Via our sister site, Webmasterworld, I found out how Twitter build their algorithms with real people to define & refine relevancy for real time (and arguably not so real time) search.
The Twitter Engineering Blog (that I personally find ironic and funny is using Google's Blogger service) posts
From a search and advertising perspective, however, these sudden events pose several challenges:
- The queries people perform have probably never before been seen, so it's impossible to know without very specific context what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for "horses and bayonets" are interested in the Presidential debates?
- Since these spikes in search queries are so short-lived, there’s only a small window of opportunity to learn what they mean.
Twitter's answer is, IM(not so H)O a superb balance of efficiency, speed and simple common sense in bringing together The Wisdom of the Crowds to deliver a damn fine answer.
- First, we monitor for which search queries are currently popular.Behind the scenes: we run a Storm topology that tracks statistics on search queries.For example, the query [Big Bird] may suddenly see a spike in searches from the US.
- As soon as we discover a new popular search query, we send it to our human evaluators, who are asked a variety of questions about the query.Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon's Mechanical Turk service, and then polls Mechanical Turk for a response.For example: as soon as we notice "Big Bird" spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant Tweets and ads.
- Finally, after a response from an evaluator is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.
Do you think this is the way forward and how Twitter could take on the behemouth of search to deliver a long term route to market and build a relevancy dataset based on real time data for future, non real time use?