Supplemental Results and docIDs

3 comments

Since everyone is still trying to explain what is going on with the supplemental index I thought I'd look up Daniel Brandt's old theory which Matt Cutts pooh-poohed on more than one occasion.

Here is where he posted it on threadwatch many moons ago http://www.threadwatch.org/node/2734 and it got a pretty rough reception from some of the guys who are now TW editors :)

Quote:
Google executives also were asked about innovating in server architecture in the future, given that one of the company's biggest rivals, Microsoft, is developing search tools on a 64-bit architecture. Google currently runs its search service on a 32-bit architecture. Search experts say that platform may allow for advancements such as better personalization. Google co-founder Sergey Brin downplayed the importance of the underlying architecture. 'I do not expect that the particular choice of server architecture is going to be a deciding factor in the success of our service,' he said."

My theory is that Google has a round robin of three indexes. Something like the main index, the URL-only listings, and the Supplemental Index. They have a rotation system that makes room in the main index on a regular basis. We know that they compute PageRank continuously these days (GoogleGuy just said so), but since they're still using a 32-bit system, that means that PageRank maxes out at 4.2 billion docIDs. It's impossible to do a PageRank calculation unless each page has a unique ID, and since Google is still stuck on a 32-bit system, this leaves pages that cannot be included in this calculation.

This helps explain the instability of Google's results with each new update. Some sites get into the main index, but only if other sites fall out. The results you see are a blend of various indexes and anti-spam filters. But the main reason for the general instability and deterioration of results is the limit on docIDs caused by the limitations of the 32-bit architecture. Sure, you could hack all the software to make two grabs and put together a docID that exceeds this limit, but it is easier to play games with rotation of various indexes than it is to hack the software. Besides, making two grabs for the docID is CPU-intensive, as the docID is called up constantly in real time. It's probably the single most pervasive piece of information in the entire Google system.

It really makes sense to me. I mean:

- Why have 2 or more indexes? Doesn't that make PR calculations harder or impossible? Doesn't it also mean extra queries to deliver a single set of SERPS? And isn't it slower?

- At one point nearly all Google was supplemental. Perhaps the main index was out of use?

- Even now Google pushed the majority of pages into the supplemental index. With all the DB generated spam they'd need a huge index to store all this poo in

- Matt Cutts recently revised his wording of what goes into the supplemental index to include pages which are less important and not only background pages to augment difficult queries.

All this seems to me to indicate that the main index is very constricted and the Supplemental index is much bigger.

If it's so does it matter if your pages are supplemental? Does it tell you anything except that the page is less important than one in the main index? How to tell the difference between a good and bad supplemental page?

One thing bothers me though - if this is true, why haven't they switched off the main index altogether?

*note - google doesn't bother indexing that old TW page. Not even in the supplemental index :))

Comments

I still think it's probably

I still think it's probably poo but in the interests of democracy and discussion .... :)

Ya know

The last time this poo went 'round, I called a math geek to make sure it was poo. He said, 'that's poo'. I trust the math geek. I trust his colleagues that agreed with him. I trust him even more because he laughed when I read Brandt's theory too him. Then he asked, is this Daniel a reporter, because he sure isn't an IR guy...

I've now said that the "4

I've now said that the "4 byte docid" theory is wrong multiple times, e.g.
http://blog.outer-court.com/archive/2005-11-17-n52.html
and specifically the quote can be found here:
http://www.mattcutts.com/blog/debunking-google-in-bed-with-cia/#comment-91063

I'm happy to re-say that our docids are not 32 bits.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.