Yahoo! Telling Porky Pies About Index? Google thinks so..
Source Title:
In This Battle, Size Does Matter: Google Responds to Yahoo Index Claims
In This Battle, Size Does Matter: Google Responds to Yahoo Index Claims
Source Url:
http://battellemedia.com/archives/001790.php
http://battellemedia.com/archives/001790.php
Story Text:
This is kind of funny, as i was talking to a chap at Google who "joked" that maybe Yahoo! had just counted all the urls in their DB before de-duping them. Now, i see John's been talking to GOOG and they're "officially baffled".
This is kind of funny, as i was talking to a chap at Google who "joked" that maybe Yahoo! had just counted all the urls in their DB before de-duping them. Now, i see John's been talking to GOOG and they're "officially baffled".
I spent an hour or so on the phone with a group of Google folks, and they shared a lot of information about how they measure index size, how they deal with issues of duplicate URLs and documents, and why they are baffled by Yahoo's claim.
[...]
"Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn't support the 19.2 (billion page) claim and we're confused by that."
I've got to say that I find 20bn a hard figure to swallow also, but Google's comments do strike a certain "sour grapes" chord at the same time.
Now, the question is, are Yahoo! stuffing socks down their trousers, or is it really a whopper?
- Y! MyWeb

Searching Y.com for "the"
Searching Y.com for "the" produces 11 billion results. Of course, there are non-English pages to consider, images and other non-linguistic objects.... but are there really another 8 billion of those?
Is there a group of Japanese webmasters with massive recursive dynamic sites clogging up Y crawl control? Or have Y started to crack the "deep web" problem, and genuinely stolen a march on Google?
It's over a year old but is
It's over a year old but is linked to from John's blog in the comments and is still a damn good read. The Technology Review article called Google and Akamai: Cult of Secrecy vs. Kingdom of Openness
Could the new version be called "Google, Akamai and Yahoo: Cult of Secrecy" ?
Personally I don't care how big the index is, but I do believe it matters to Wall Street and Joe Public where big generally is perceived a better
Fact is...
...that no-one can be arsed to count them all. So it's a cool claim. Maybe not accurate, but plausible. I don't really buy it, but what the hey... ;-]
the calculation is simple...
Spider all domains, multiply result by 1.8 for sites that don't redirect non-WWW to WWW, multiply that result by 2 to account for all the sites that allow strings like www.example.com?foo, have the PR department intern slip a decimal point and voila, 19.2 billion docs.
Seriously, I am amazed that Y is claiming that number. I have a site that has been online since March 26th, and has 4400 pages indexed by G, 1005 indexed by MSN, and only the home page indexed in Y. Wouldn't they have been hammering any and all sites that they came across for the past several months in order to reach this volume, yet there has certainly not been any sort of increase of spidering that would have foreshadowed this announcement.
Remember the month or two before G announced their new index? Everyone's logs were hammered with gBot tracks.
whenever i do a backlink
whenever i do a backlink search on yahoo it always returns many more results than google, i guess that proves yahoo is bigger ;) hehehe
About 15 times bigger...
...LOL
Subscription Content
They wouldn't be counting their new Subscription Content, too, would they?
Would they!?!
Comparing raw hits to raw hits
NOTE: I am not using the quote marks shown below in the queries. These links go to FIND ALL queries, not EXACT FIND queries. I do not compete for any of these terms on any of the Web sites that I control or assist people with. These queries are, from my point of view, random.
Yahoo! for "real estate" (597,000,000):
http://search.yahoo.com/search?p=real+estate&fr=FP-tab-web-t-291&toggle=1&cop=&ei=UTF-8
Google for "real estate" (110,000,000):
http://www.google.com/search?hl=en&q=real+estate
Yahoo! for "travel" (1,410,000,000):
http://search.yahoo.com/search?p=travel&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "travel" (400,000,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=travel
Yahoo! for "britney spears" (69,000,000):
http://search.yahoo.com/search?p=britney+spears&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "britney spears" (4,600,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=britney+spears
Yahoo! for "news" (4,120,000,000):
http://search.yahoo.com/search?p=news&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "news" (1,670,000,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=news
Yahoo! for "university" (983,000,000):
http://search.yahoo.com/search?p=university&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "university" (855,000,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=university&btnG=Search
(NOTE: Harvard is now beating out Stanford on that search. I don't know when that happened.)
Yahoo! for "napoleon bonaparte" (2,610,000):
http://search.yahoo.com/search?p=napoleon+bonaparte&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "napoleon bonaparte" (685,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=napoleon+bonaparte
Yahoo! for "care of elephants" (3,050,000):
http://search.yahoo.com/search?p=care+of+elephants&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "care of elephants" (698,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=care+of+elephants
Yahoo! for "specimen" (16,300,000):
http://search.yahoo.com/search?p=specimen&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "specimen" (6,050,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=specimen
Yahoo! for "brazen hussy" (63,000):
http://search.yahoo.com/search?p=brazen+hussy&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "brazen hussy" (17,800):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=brazen+hussy
Yahoo! for "horticultural exchange" (708,000):
http://search.yahoo.com/search?p=horticultural+exchange&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "horticultural exchange" (346,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=horticultural+exchange
Yahoo! for "experimental design change" (8,360,000):
http://search.yahoo.com/search?p=experimental+design+change&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "experimental design change" (10,100,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=experimental+design+change
Yahoo! for "corporate headquarters" (18,800,000):
http://search.yahoo.com/search?p=corporate+headquarters&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "corporate headquarters" (12,000,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=corporate+headquarters
Yahoo! for "kiddie rides" (586,000):
http://search.yahoo.com/search?p=kiddie+rides&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "kiddie rides" (142,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=kiddie+rides
Yahoo! for "spontaneous combustion" (933,000):
http://search.yahoo.com/search?p=spontaneous+combustion&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "spontaneous combustion" (409,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=spontaneous+combustion
Yahoo! for "course curriculum" (36,400,000):
http://search.yahoo.com/search?p=course+curriculum&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "course curriculum" (28,100,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=course+curriculum
Yahoo! for "our wedding" (250,000,000):
http://search.yahoo.com/search?p=our+wedding&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "our wedding" (25,600,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=our+wedding
Yahoo! for "i wrote this song" (32,000,000):
http://search.yahoo.com/search?p=I+wrote+this+song&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "i wrote this song" (7,470,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=i+wrote+this+song
Yahoo! for "dog and pony show" (3,210,000):
http://search.yahoo.com/search?p=dog+and+pony+show&ei=UTF-8&fr=FP-tab-web-t-291&fl=0&x=wrt
Google for "dog and pony show" (802,000):
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&c2coff=1&q=dog+and+pony+show
You make the call.
yahoo makes up urls for some
yahoo makes up urls for some of my domains.
.com/fd
.com/cart
.com/list
and more
all of which don't, have or ever will exist. Since I have a mod rewrite going any file or folder that doesn't exist automatically shows the site map. So if you typed in .com/seomike-rules/ you'd get a page. As I discover these made up urls I add rules to trigger 404's yet they still exist in the yahoo index... odd.
I think yahoo is way off on it's index count.
Maybe Google's not counting
Maybe Google's not counting all those sandboxed websites (ducks and runs for cover)
Estimate
The cool thing is that Y only has about 60% of all my pages.
So, if they ever get to the rest, and my situation is representative, their number could easily climb to 32 billion. Whoa.
Garbage
MM's queries just prove that
1. Google doesn't index garbage.
2. Both engines guess the number of matches.
3. The real answer is 42.
It is not possible to estimate the size of a SE index by quering anything. Period.
Although I doubt it, Yahoo may have crawled 20 billion pages in infinite loops on session IDs and other unproductive cycles. Probably every fetched 'page' (plus all embedded objects) got an UUID assigned. Then counting the UUIDs gives that useless number. I bet that only a fraction of those crawled pages made in the index. Google seems to count indexed pages which can appear in searches, but their published number of searchable pages isn't accurate.
Easy
It's easy, just count each post in all the darn "Google Dance" threads ever made as a page and yer at 15 bil. easy. That leaves 5 bil. for all the rest of the internet people use.
Look at the 2nd result in
Look at the 2nd result in this query
site:www.mattcutts.com
Sergey says its cobblers
Article in NYT quotes the great man . Funny thing is G will prostitute themselves talkingg to NYT!
Nick google has it and
Nick google has it and www.mattcutts.com/blog/feed/atom/ that one
DaveN