Click Streams are Dirty


Lee Odden asks Matt Cutts about clickstreams. Matt made it sound as though generally they had little value:

I’m not going to say definitively that Google doesn’t/won’t use toolbar data (or other signals) in ranking. I think what you were picking up on was my long list of “cons” in data like that.

Regarding cons of using toolbar data, the main reason would be if people were to spoof toolbar data to make a page or domain look more visited than it was. For example, at SES New York, I pointed out that Alexa provides a “Related Links” feature for web sites, and that data had been spammed to show related sites being job sites, poker sites, etc.

This is quite contrary to Mike Grehan's recent article mentioning clickstreams.

Links are good. But you can only get links from other people who have Web sites. What about the millions of end users who don't have sites? The only way they can show a search engine their approval of results' relevancy is by voting with clicks.

I now tend to look at links as a peer group review. If your community thinks your site is the greatest piece of work ever and decides to link to it, you have their vote. End users will decide whether the community was right or wrong.

Is Matt leading us astray, or do you think Google does not use clickstreams at this point? With as sophisticated as some of their algorithms are you would think that they would be able to find some signal out of toolbar data, wouldn't you?


Both Mike and Matt's Points Are Valid

It can be a great tool, but it's also easily manipulated. That being the case, if it were my engine, I would go out of my way to convince my enemy I would never user it. That way I could use it without the data being tainted.....

Yeah if I remember correctly

Yeah if I remember correctly Mike Grehan mentioned user data and/or search patterns (e.g. searches for your website's name) as possible reasons 'legit' sites beat the sandbox-which-doesn't-exist.

And again if I remember correctly Brett Tabke stated in that recent WMW thread, point blank, that Google has been using user data since Florida.

Cutts' points would seem to contradict that sentiment -- almost. But in classic Cutts fashion, he does, of course, leave room for gray area.

Ultimately, I'm going to say they ARE using user data. Of course it is muddied and spammed. Did that ever stop them from analyzing links?

They'll just try to use the less-spammable components from user data... there's a signal in there somewhere

> I would go out of my way to convince my enemy I would never user it. That way I could use it without the data being tainted


not "exactly"

Some time ago pre-Toolbar, when everyone was starting to see how links might be ... well let's just say "managed for competitive gain", they were suggesting that clickstream would be really useful, and that link pattern research would be revealing. Once the Toolbar was out, ole Bret routinely fired it up to bring the spiders to his new sites/pages, and advised others to do the same. Clickstream at work. Brett's an old hand at click stream manipulation... he's one of the reliable stalwart defenders of Alexa rankings (I think he has a webmaster website, too).

Now I would not be surprised to learn that link pattern analysis turned out to be easy, and clickstream analysis hard. That suggests it is still a challenge for Google to use it, yet they probably partially implemented parts of it all along where it works. As quality review suggests it is more harmful than helpful, Matt's CONS list grows and clickstream data use gets throttled. Would Matt know at any given time how much clickstream data is in use? Nah. Would he know there are problems with use of clicksream data? Sure... and he probably knows the signs of over reliance as reflected in the SERPs.

Does Matt have confidence in the potential for clickstream data to significantly improve SERPs? That is what I take away from this.... not alot, because it can be abused so effectively.

I think I just said the same thing that Boser said, but without suggesting that it might be a great tool for the SE. I suspect it is too fickle right now and not as useful as contextual work, blog work, and link work combined with trust work (sitemaps, analytics, adsense, whois etc).

I think the toolbar can be

I think the toolbar can be combined with all the other forms of tracking to give them some signal of quality. Sure it can be spoofed, but the places where it is being spoofed ... most of those places probably lack the other corroborating quality signals necessary to rank.

Google's Patent

I've created my own little mini guide of this patent from Google.

Information retrieval based on historical data

It's like a road map for Google's algo. Much of what is detailed in the above patent application (and others) can be applied to much of what is being discussed at various fora, blogs, etc.

They have some revealing notations in there about Clickstream data.

While it is not proof of anything, there sure are quite a few coincidental things going on between that patent and what I see on a daily basis with Google and other search engines. Just a reminder, the above patent application was originally filed 2003 December 31, almost two and half years ago. That gave them plenty of time to fine tune some of that stuff. ;)

It's an interesting read if you don't mind watching paint dry. ;)

Clickstream used for sure. It's easily abuse filtered!

Of course they'd be able to use the data. They can build a spammer site filter so they can definately build a clickstream abuse filter.

The filter would use things like:

  • IP uniquness
  • Presence of a functional referring link that can facilitate traffic flow from one site to another? (I'm surprissed Alexa doesn't look at that. How could so many people be referred to one page from another when there's no link to it. That should simply be flagged as possible spam)
  • Duration of visit

    In additon to Cutts not denying it, it's nearly a sure thing it's used in my book.

  • They almost surely use it.

    The biggest use of Toolbar clickstream data ought to be to spot when the data a user receives might be significantly different to the data that the spider has received. Spot that consistently over time, and there's a URL ready for an engineer to go check for cloaking.

    Try using toolbar data for any kind of random sampling though and it is likely to fail. The user base of the toolbar is just not statistically 'average' enough. About all it would be good for is spotting the type of demographic profiles that install toolbars.

    Matt is spot on to say it is 'dirty' data, and over-prone to manipulation. But it would be a very useful tool to spot certain kinds of dirt for just that reason.

    Also useful perhaps would be to identify domains, and indeed IPs involved, where the toolbars are 'apparently' regularly finding new pages before the spiders do. Used right, that could be a new way to track the Brett's out there, and tie their networks together. ;)

    You don't think the toolbar

    You don't think the toolbar users are average enough to have that play a roll in some sort of temporal algorithm of some sort?

    Would this
    be mostly explained by exact searches for the domain name from people logged into Google accounts then?

    Good to see ya at TW Ammon :)


    Since you ask, I'd contest that the situation at the other end of that link was showing good old 'fresh bonus' effect. However, having a savvy marketing friend involved simply meant that other initiatives kicked in before the fresh wore off. That's how the fresh is ideally supposed to work of course. Its only the sites that don't actually offer much that disappear again because all they really had to offer was 'freshness' for a while. :)


    Doesn't hitwise use clickstream data? Don't they buy it directly from the ISP's?

    Hitwise do buy raw logs

    Hitwise do buy raw logs direct from ISPs and extrapolate their results from there. In AU and UK they have some amazing information. In the US it is great but not as good as the previous mentioned countries

    wait a minute...

    >> Hitwise do buy raw logs direct from ISPs

    I would be very interested in buying that kind of information myself (well, actually it's not me, but nevermind)

    Which ISPs is this - larger/smaller ones? Got any links to more information about that? Official confirmations or so?

    What about the privacy policy? Any other countries where this is possible?

    It's very simple to know

    It's very simple to know they use it.

    Look up the term "Roses" in mid January, and the results talk about the Tournament of Roses.

    Look up the same term in Febuary and all the results are about buying roses for Valentines day.

    Duh. I've known this for years.

    Comment viewing options

    Select your preferred way to display the comments and click "Save settings" to activate your changes.