Google Patent - Analysed

18 comments
Story Text:

I've spent a good deal of time of the past two days analyzing the Google patent.

Now part of me says you spent the time doing the research you should keep it to yourself and not share. The other part of me says two heads are better than one, and someone could point out a flaw or oversight in your research, helping you reach a new conclusion. That part won the argument, so I'll share my thinking.Many people who have seen the document think it's nothing more than red herring to throw people off, others disagree. I believe the patent expresses three things:

  1. Factors Google thinks are important and may be in the current algorithm
  2. Factors Google thinks are important and want to incorporate into the algorithm in the next 3-5 years
  3. Factors Google would like to stake an early claim to, so competitors don't use them.

If you read throughout the document you'll see many very broad, wide sweeping and often contradictory statements, causing people to dismiss the document as rubbish. However I think they are missing the point. What Google is saying is the actions and behaviors of Search Engine Optimizers mimic those of real websites, however they differ in scale, intent, and relationship to other factors. For example if a website suddenly gains 500 new links in a week is that good or bad? The answer is it depends. If the links were for a breaking, hot or trendy search term probably not, otherwise it probably yes. So if a website had a higher than average growth of inbound links for a particular term, yet there was no corresponding spike in search volume, then it would be reasonable to assume that the growth in links is spam. From the algo's point of view the relevant anchor text would be given a high score, but also have a strong indication of being search engine spam. When you look at the website or document as whole, if it has lot of factors that have a strong indication of being spam, the likely hood of it not being a "natural" occurrence increases and you're website will be filtered out (aka sandboxed).Think of it this way let's say you're driving a red corvette down the street. You won't attract to much attention. Now add in that you're driving 10 miles per hour over the speed limit, still not that big a deal. Now add in a broken tail light, starts to look more suspicious. Next you've got the convertible top down, and the music is blasting. Finally the person in your passenger seat is hanging out of the car flailing their arms screaming. You will get pulled over. Other than your passenger hanging out of the car, none of the offense would get you pulled over by themselves, but the more of them you combine the more likely you are a troublemaker.Here's a list of some of the factors mentioned in the paper, again there are perfectly normal reasons for any of these monitored factors to change, the point I make is more warning flags that you set off at once the more likely you are to be spamming the search engines. I've included the sections where I drew my conclusions from for reference.

Domain Factors

  • Length of domain registration (section 0099)
  • Domains are monitored for changes in expiration (section 38,39)
  • Nameserver, and Whois data is monitored for changes and valid physical addresses (same technology used in google maps)
  • Name servers and possibly class C networks should have a mix of whois data, registrars, and keyword and non-keyword domains (section 0101)
  • Documents/websites are given a discovery date when they are discovered through any of the following means
    • external link
    • user gathered data(sections 1,2,3,4, 38)
  • Websites must have more than one document (section 5)
  • Change in the weighting of key terms for a domain are monitored for changes (section 50)
  • Changes in a domain to topics that don't match prior content are an indicator of change of focus, existing prior links will be discounted (section 0084)

Documents and Pages

  • Documents are compared for changes in the following
    • frequency (time frame)
    • amount of change
    • (section 6,7,8, 9, 11, 12)
  • Number of new documents (internal ?) linked to document is recorded (sections 9,13)
  • Change in the weighting of key terms for the document is recorded (section 10, 14)
  • Documents are given a staleness (lack of change?) rating (section 19)
  • The rate at which content of a document changes and it's anchor text changes are recorded (section 31, 33)
  • Outbound links to low trust or affiliate websites may be an indicator of low quality (section 0089)
  • Don't change the focus of many documents at once ( section 0128)

Links

  • A links anchor text and discovery date are recorded (sections 54, 55, 56, 57, 58)
  • Links are given a discovery date and monitored for appearance and disappearance over time(section 22,26, 58)
  • Links and anchor text are monitored for growth rates (section 48)
  • Links are monitored for changes in anchor text over a given period of time (sections 27, 30, 54, 55, 56, 57, 58)
  • Links are weighted on trust or authoritativeness of the linking document, as is the newness or longevity of the link (section 28, 58, 0074)
  • Link growth of independent peer documents (different class C networks?) are monitored.
  • The rate at which new links to a document appear or disappear is monitored (sections 23, 24)
  • A freshness rating of new links is recorded (section 32)
  • It is determined whether a document has trend of appearing or disappearing links (section 25)
  • A distribution rating for the age of all links is recorded (section 29)
  • Links that have a long lifespan are more valuable over links that have a shorter lifespan (section 59)
  • Links from stale pages are devalued where links from fresh pages are given a boost (section 60)
  • Link churn is monitored and recorded (section 61, 62)
  • New websites are not expected to have a large number of links (section 0038)
  • Link growth should remain constant and slow (section 0069, 0077)
  • Burst link growth may be a strong indicator of search engine spam ( section 0077)
  • If a document is stale (not changed) but is still acquiring new links it will be considered fresh ( section 0075)
  • If a document is stale and has no link growth or has a decrease of inbound links it's outbound links will be discounted (section 0080)
  • A spike in links would be acceptable if document has one or more links from authority documents (section 0110)
  • Anchor text should be varied as much as possible (sections 0120, 121)
  • The growth of variation in anchor text should remain consistent (section 0120, 0121)

Search Results

  • Volume of searches over time are recorded and monitored for increases (sections 17, 18)
  • Information regarding a documents rankings are recorded and monitored for changes (sections 41, 42, 43)
  • Click through rates are monitored for changes in seasonality, or burst increases, or other spike traffic (section 43, 44)
  • Click through rates are monitored for increase or decrease trends (section 51, 52, 53)
  • Click through rates are monitored to see if stale or fresh documents are preferred for a search query (sections 20, 21)
  • Click through rates for documents for a search term is recorded (sections 15, 16, 37, 43)

User Data

  • traffic to a document is recorded and monitored for changes (possibly through toolbar, or desktop searches of cache and history files) (section 34, 35)
  • User behavior is websites are monitored and recorded for changes (click through back button etc)(section 36, 37)
  • User behavior is monitored through bookmarks, cache, favorites, and temp files (possibly through google toolbar or desktop search) (section 46)
  • Bookmarks and favorites are monitored for both additions and deletions (section 0114, 0115)
  • User behavior for documents are monitored for trends changes (section 47)
  • The time a user spends on website may be used to indicate a documents quality of freshness (section 0094)

miscellaneous

  • Documents that change frequently in ranking may be be considered untrustworthiness (0104)
  • Keywords with little or no change in results should match domains with stable rankings (section 0105, 106, 107)
  • Keywords with high volatility of change should have domains with more volatility (section 0105, 106, 107)

Again what and how much of this is actually in place is open for debate. If you think I interpreted something incorrectly please let me know. If you have other ideas let me know I'd be glad to add them here.

Comments

 

Very good overview.

I'm still going through it and haven't formed a firm decision on any single point yet, but based on what I've read so far I don't think you're far (if at all) off base.

Thanks :)

Checking domain registrations against Whois

Matt Cutts confirmed in a personal communication to myself dated 28 Oct 2002 that Google did not currently check domain registrations against the Whois database.

This leaves us with 3 possibilities:

1. He was blatantly lying at the time and they've been doing it all along.

2. Things have changed since then and they are now doing it.

3. They're not doing it yet but are either planning to do so or are looking into it as a feasible future option.

I guess I'll reserve my personal opinion which of the three I tend to believe ...

Google Patent Analyzed

Over at ThreadWatch graywolf has a great, in-depth look at the Google Patent. I've been waiting for something like this. And while this is large itself, it'll be easier than going through the whole patent. Thanks graywolf. Much appreciated.

The trouble with patents

Thee's a ton of anti-patent noise out there, but I can't resist, as this patent shows a more subtle but equally glaring weakness of the patent system, as compared to the prior-art issue. There may not be much prior art for this stuff, but it's all utterly obvious in the sense that any one of 100,000 groups of halfway smart software engineers in the world sitting around brainstorming this problem would come up with all or most of these ideas pretty quickly.

That also shows the danger... that such brainstorming in the future will be fraught with patent roadblocks. And this process itself -- brainstorming technical ideas -- is the very lifeblood of software. Bad scene!

Fantom...

I believe Matt also mentioned that they checked the whois data in one of his presentations at the Vegas WMW.

 

At the Orlando WMW, I was with a group in the pub that asked Matt about proxy registrations. He stated quite clearly that they didn't include whois info in any algo calculations, but if a site or network triggered a hand check for other reasons, they would definitely look at the whois.

He said a proxy registration wouldn't cause any penalty on its own (there are valid reasons for using them, after all), but if there was a clear pattern of problematic activities it would likely be viewed with suspicion.

It made good sense to me at the time. Things might have changed since then, though.

---------------
Great post, Graywolf! Thank you for sharing your thoughts.

well done graywolf :)

- that's a very nice list, thanks a lot. It must have taken a few hours to order it like that.

From the "misc" department, the top one is document specific while the two others are domain specific. But then again, all three are also search specific. To sum up:

  • Link factors: 21 items
  • Document factors: 9 items (10)
  • Domain factors: 8 items (10)
  • Search factors: 6 items (9)
  • User factors: 6 items

So, most emphasis is on links. Documents and domains are second. User factors -- while much debated -- are the least important elements in number. I didn't see advertising mentioned specifically in your list? (it would probaby be at the bottom anyway)

---

Regarding domains, that's probably the closest you can get to the concept of "a site" without looking at it in person, but it should be noted that "domains" and "sites" are not the same thing, as one domain can hold more than one site, and one site can have content from more than one domain.

---

Your traffic analogy makes good sense to me. Personally, i've recently thought about it (#96) as "rating" vs. "ranking", where the former is some kind of "general signal of quality" (or trustworthyness/reputation/whatever).

All of the above are potential "rating" parameters, and so is the topical stuff (LocalRank, LSI, Hilltop), and perhaps also the good old PageRank (i know this might be seen as controversial by some, ie. PR not used for ranking, only for rating). Added to this, there is the 100 usual factors that are used for ranking.

This is pure speculation at the moment, of course, but it appears natural to me to consider that "the higher your general rating is, the more weight does the ranking factors have" - so, a lot of good ranking factors (eg. 1000 IBL's) without a high rating leads to nowhere.

---

Added:
The words "leads to nowhere" should be "will lead to nowhere" as it's obviously not the case now (at least not across the board). I think this could very well be the general direction we're moving in and it might take some time to get there for all types of SERP's.

---
Added(II):
I wonder if i could go out and patent the "ranking vs rating" line of thought... I guess the "Fishbein" technique from market research is prior art though. If not, i have already published the above to the public domain *lol*

More..

Wonderful stuff GW, thanks!

I see Nathan is also talking about what it means to webmasters - im not sure i entirely follow, but it's worth a look.

Quote:
Here's one interesting fact: PageRank isn't about the number of links, its about link growth. Sheer volume of links is meaningless, because Google tracks historical link volume data, determining rises and falls in the number of links. If your site earns a steady number of links every month, it may never move up in the rankings, because it is not gaining in popularity. Link building campaigns are one step removed from meaningless, because they can never gain momentum. In a sense, web spam won't help rankings as much as might be thought, because you cannot infinitely increase the rate of spammage, and the moment it drops off, your site is dead.

Emphasis mine.

This does all rather assume that these points in the patent are actually implemented, i assume most folks think "business as usual"?

Business as normal?

Yep - 'cos we all knew that most of this was possible in theory just not if they were doing it - and we still don't.

Still huge questions in my mind - seasonal products naturally get different linkage levels/rates at different times of year, occasionally most sites get an out of character link boost naturally, how do you interpret what they're saying anyway? (that quote above isn't at all how I'd translated it - I think there's as many interpretations as there are people looking at the patent)

My evil twin is considering how easy it may soon be to kybosh a competitor though :)

More

Rand has some detailed thoughts also on the patent.

Im sticking with "it's all BS" though for now :)

BS Patent App

Yup Nick, I am getting feedback from others as well that leads me to believe that this patent app is BS.

Not BS - no way

Because I have clients who handle thousands of domains, I've seen many of these factors in play for quite some time now.

For example, "fresh links" have been in place for about 1 year. For 1 client, we switch the linkage architecture every 4-6 weeks, and sites that are past the sandbox to which we apply this have done very well over time.

Also, why else would they have been developing other tools, such as desktop search? What do you think they're doing with the data they've gathered from the toolbar, that millions of people have happily given to them?

This patent clearly shows how Google can evolve. Everybody knows that current link based algos are flawed. By including all these other factors they tie the hands of SEO's and only allow legitimate websites to thrive.

I'm still working on deconstructing this and understanding which factors are at play and which aren't. However, I can almost guarantee that many factors are at play, and many others are being added as we speak, and more to be added in the next few months.

 

I don't want to sound like I am on the "BS" side right now, but until we see Google clearly tied to this patent we should be cautious, that's all.

Most of the info in this patent has been speculated to be true by MANY online marketing professionals for some time now. Anyone could construct this type of patent application.

hmmm - defind BS

I have more 'evidence' that some of this isn't in action than that it is, but I don't keep the right sort of records to really argue one way or another on most of it. I wouldn't say it was all BS though - for starters you can patent something before you use it, so just because to date they haven't done something doesn't mean they won't/can't.

Plus everything on it makes some type of sense and most of it isn't technically that hard - incorporating it into the algo might be tricky but if you isolate all the bits out there's nothing that's rocket science as a stand alone issue.

So how can you dismiss it totally? I agree it's a nice bit of drama and guaranteed to scare some people but it doesn't seem entirely written for effect, does it?

 

>>So how can you dismiss it totally?

I dont in truth, your next sentence is why i tend to lean towards the BS camp

>> I agree it's a nice bit of drama and guaranteed to scare some people

I see 2much's point as clear as day, i dont track like she does, but i listen, and watch a lot of conversations from those in the know and that seems to be a reasonable and widespread 'truth'. I suspect other things may come into play, but mostly i think it's future/possibility stuff....

Proof

Nick, I'll put together some data for you.

 

>> Anyone could construct this type of patent application.

P'rhaps we should all chip in and patent the next batch of ideas we don't want used? I'm thinking it's win win, either they won't be used or we could make some money back to compensate for the lost traffic :)

>>Nick, I'll put together some data for you.

That would be awesome!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.