How to Get 5 Billion Pages Indexed in Less Than 30 Days

78 comments

Seeing how this has hit the front page of Digg I'd say we're well past the stage of 'outing' someone's spam. So Looking up the whois data we see the domain was registered on May 31st 2006

Throw in some blog comment spamming, some 302 redirects and some uniqification and in less than 4 weeks you to can over 5 billion pages indexed!

You can get some more details at Digital Point and Merged.ca.

I have to say overall I'm not impressed with anything I'm seeing out of Big Daddy at all. I know Google stopped publicly counting pages, but 5 Billion Pages has to represent a absurdly large portion of all the pages in the index for one site to have. Look at Apple.com who has been around for years only with a mere 71 Million pages, Microsoft has 172 Million. For a site to be less than 4 weeks old and get 5 billion pages one would think that many monkeys with that many typewriters would have set off at least one or two warning flags on the way in ...

Comments

Yeah .. It's astonishing

Yeah .. It's astonishing ..

He based himself on subdomains.. Because no single domain/subdomain can hold more than 300-400 million pages.

He/She has literally millions of subdomains with a few pages each ...

Fucking spammers.

After Hoff mentioned the story and after some of us wrote about it too, he removed the Adsense ads from the domain itself (t1ps2see.com - which is the landing domain. using redirects, for all the indexed pages on eiqz2q.org) and inserted them to a more deper level.

Not quite what it seems

Hey there,

Regarding the "5 billion indexed"...

We have noticed that some site: queries are showing bizarre results and it's turned out to be tied to a bad data push. We're fixing it now.

Adam, I have to say this

Adam, I have to say this (which I said on my blog too) :

I can’t even imagine how the heck Adsense doesn’t have some automatic filters or sandbox.

How can one publisher generate billions of ad impressions with a 3 weeks old website and not to ring a bell or a red blip on the Adsense’s team radar.

And it doesn’t matter if she (the spammer) switched the Adsense to a more deeper level. The other domain is her’s too.

clarification

Just to be clear you're saying that
this domain, this domain, this domain and this domain which were all registered on the same day to the same person and less than 30 days old managed to get 63 million pages, 230 million pages , 29 million pages, and 63 million pages is a "bad data push" and is not a "grand slam spammer" who turned his software up to 11?

is not a "grand slam

is not a "grand slam spammer" who turned his software up to 11 ?

ROTFLMAO

I think that's what he's saying. :D

No

I'm saying that the results counts are drastically off. Sorry for not being more clear.

Also, Expertu, I'm not on the AdSense team, so I can't offer any substantive details on this specific situation re: AdSense TOC and earnings.

With that said, though, I'd hope TW'ers would realize that just because people click on an AdSense ad, it does not necessarily mean that an advertiser is charged and a publisher is paid.

it does not necessarily mean

it does not necessarily mean that an advertiser is charged and a publisher is paid.

But it could just as easily be that. You can't deny this right ?

About the 5 billion indexed pages for one domain and several hundred million for the others (owned by the same entity), it's not possible man.. It's off the charts. I guess no other domain in the world has these numbers. It's unbelieveable.

And in my opinion it's not a matter of "bad data push". It's just a matter of millions of subdomains, with a few hundred pages each.

Anyway, I'm wondering why it took you guys so long to discover this (I mean, you discovered it when we discussed it here at TW ?)

Don't you have internal statistics like : "Top domains with the most indexed pages" ?

This spammer would've been first on your list.

Sorry if I'm charging you in such a manner, but a more faithfull approach, other than "bad data push", would've been great (especially since we're not some local newspaper editors).

Am I crazy for saying that ? Maybe.

Bad Data Push

You know what's also "funny" is how that "bad data push" looks completely different and there's no 302 redirect when you change your user agent to Yahoo/slurp Screen shot

Nice find man.

Nice find man.

No worries, Expertu

I don't post on TW expecting hugs and kisses and I don't take stuff here personally. :)

The incorrect results-count estimation (e.g., that we have billions of these pages in our index) was caused by a bad data push.

I'm not the person to debate with about AdSense stuff.

Anyway, I'm gonna call it a night (at least geekwise) for now. I'll check back here on Monday.

Bug, yah, right!!!

http://www.alexa.com/data/details/traffic_details?q=t1ps2see.com&url=t1ps2see.com

Explain how they got to ranking 1,894 in Alexa, in just a week or two!!! It's because Google aka Spamoogle, is messed up and now loves spammers.

BigDaddy should be called CrappyDaddy.

Right!

And anyway in April 2005, you had :

©2005 Google - Searching 8,058,044,651 web pages

I guess this website is more than half of Google's index :)

PS: Why did you guys remove that "searching X pages" note from the footer ? (tricky question)

I'm gonna call it a night (at least geekwise) for now

Euh, shouldn't you be fixing this problem instead of calling it a night ?

Where's the problem?

It's a website. A large, automated one. Big deal.

Maybe one day we'll all use it for search, instead of Google ;)

Maybe one day we'll all use

Maybe one day we'll all use it for search, instead of Google ;)

It's 60% of Google's index. One more week and it's as big as Google :)

Bad data push and incorrect

Bad data push and incorrect results-count? Yeah, ok. And so it showing up in the top 10 RANKINGS for all kinds of things (including showing up at least 4 times in the top 10 for some phrases) is just a mis-count too, eh? I'd suggest that whichever tech told you it was a site: command problem should be called back into the office for a longer discussion.

should be called back into

should be called back into the slauther house for a longer discussion.

And I think that the boost in the SERPS it's because the domain is new, and all of them get that ..

Check out my 2 months old domain (webotopia.org is mine there) : People Society Crime.

you havent seen the best bit yet....

I can't wait for them to fuck up GPay like this :)

"Oops, it was a 'bad money pull' that caused us to debit $100 from 1 billion peoples credit cards".

Can you really trust Google with your money? Your business? Your data?

I think we're all too

I think we're all too pesimistic.

It's just a small problem they have with a certain "bad data push". They're fixing it now. .. That's all man ..

ROTFLMAO.

"Oops, it was a 'bad money pull'

as opposed to 5 billion SpamSense pages which is a

'good money pull'

someone probably got a pat on the back.

MSN got fooled with 62 pages

Indexing is nothing

Indexing is nothing.

Ranking is everything.

*me goes back to building my websites one page at a time*

This is really poor

This is really poor on Google's part and the explanation of a "bad data push" is also very poor.

Legitimate sites which have been de-indexed as of late could have lost pages as a direct result of this site. This is worthy of the mainstream media picking this up and I hope they do.

For all the geniuses at Google, they really do a lot of simple things wrong. How about something easy like: When a site gets over 100 million pages (sub-domains included), we have someone hand review the site.

This site is receiving huge traffic, many top tens and Alexa rankings prove that. Maybe site: is not working properly (it's off by a billion pages or so) but this is a massive site to have a top 2,000 Alexa ranking in 4 weeks.

This is also a strange statement:

just because people click on an AdSense ad, it does not necessarily mean that an advertiser is charged and a publisher is paid.

Okay, you mean if click-fraud is detected but what does that have to do with real users and legitimate clicks on spam pages you let in your index.

Yeah, a bad data push that involves billions of pages to one site and other sites loose pages to make room.

Adam, the webmaster and SEO

Adam, the webmaster and SEO community have been talking about very inflated results numbers for quite a few months.

Are you implying that Google just now noticed it?

MSN got fooled with 62

MSN got fooled with 62 pages

Hardball, did you had a look at any of those pages with any other user-agent, except Google ?

Like Slurp or MSN. Go ahead.. Have a look.. they don't redirect to the http://t1ps2see.com domain .. They just show ads.

ranking ain't everything

>Ranking is everything

Nah traffic is everything and looking at those alexa stats I'd say he's not only got rankings but traffic too ...

got rankings

and a nice search term/kw list too.

looks like 10-20% is hitting for some decent stuff, scraping the page titles would generate some goodies.

qwerty

[edit]This poster is an uneducated moron, posting long lists of porn sites [/edit]

This is awesome... my favorite result

67 of the top 100 rankings for "pizza sauce recipe" went to one domain:

http://superaff.com/downloads/googlepizzasaucerecipespam.pdf

hilarious

... surely some members here have pages that are similar to these.. or what?

YO!

I don't :)

What's my Beef

I got no problem what so ever with autogen content.

What I have a problem with is the "pay no attention to the man behind the curtain" responses to the issues with "big daddy". Indexing and ranking billions of pages is more than an "oops" type of error. I'm seeing all sorts of indexing goofs, content being attributed to the wrong domain, pages being dropped, sites not listing under searches for their domain URL, and a laundry list of other items.

We keep hearing that big daddy is more efficient with a new crawling architecture uses the adsense robot blah blah blah.

I can honestly say I'd rather have less efficient crawling from only one bot with less screw ups any day of the week.

Give the man some credit ....

Getting that many pages indexed that quickly must have really taken some doing. And using Googlebots bandwidth like there is no tomorrow.

Avoiding the 'sandbox' too.

Be Nice

At least Adam posts, we don't want him to go MIA like Matt on permanent vacation, it's nice that we hear anything from Google. Heck, no posts from Google Guy on WMW since May 8th, so a little information is nice to see someone is still awake at the 'plex.

the explanation of a "bad data push" is also very poor

If they told you technically what went wrong you probably wouldn't understand it anyway.

My wife works for a different search company and that's what I hear them call it too, a
"bad push", and that term refers to both rolling out new software and/or data that fails.

GoogleBot Monitor

Now, lemme see....

GoogleBot Monitor Console

Spidering 1 of about 17 795 462 domains:

Accessing site: 1 (eiqz2q.org)...

Spidering page number: 5 145 473 298 of about ?????? pages.

Scanning...

* * * Disk Write Error * * *

Hard Drive Full: 250 000 GB used (0 free)
Abort, Retry, Fail?

Bwhahahaha!

>If they told you

>

If they told you technically what went wrong you probably wouldn't understand it anyway.

Nobody is asking for three-page technical explanation, but don't just brush it off as site: is not working either. Alexa rankings under 2,000 in 4 weeks along with the fact that this site is ranking in searches points to a large problem. Fess up, no smoke and mirrors or PR spin please.

Openess and honesty is all that is being asked. Did this site having billions of pages being indexed affect other sites being de-indexed? Could someone at Google please answer that honestly.

I don't know...

I don't know, most of my sites are doing very well maybe even better than well in a time when others are complaining about being dropped.

I spent a few weeks in google sitemaps group and 90% of the time people's complaints stem from things like:

1 - Canonicals and failure to set a 301 for new sites (don’t laugh it really works for those with few backlinks - google needs to work on this still.)

2 - Link Building - Connecting synapses with those who have sinned, it was smart not to "link build" on the sites I care about, looks like the payback for years without success is coming.

3 - Content that is filled with air, you know, the old "content is king" let's have our writers write about stuff they are not passionate about and pay them? (A damn funny thing to watch sorry)

4 - COMPETITION - There are a lot of really smart people who know nothing about "SEO" getting in the game and doing well, Google loves these folks.

Agree with any of this?

Bad data push?

Quote:
We have noticed that some site: queries are showing bizarre results

Might have a look at the link: command while you're fixing that too. I'm sure you guys have no problem getting feedback from folks, but you would probably get better more accurate feedback from ALL the people that watch if you tried to give better more accurate tools and results (rather than misinformation and spin).

I think I am obligated about once a week to remind folks of the transparency with yahoo and their increasingly better results compared to the increasingly shrouded veil of secrecy of google, and their increasing problems with scalability and data integrity. Results might even be good, but what happends to credibilty when you start feeding misinformation through all the tools that we use? Why not get rid of TBPR rather than feed bullsh*t numbers from it that mean basically nothing? Why have a link command that is pure tripe?

Man...it's not even FUN to be a googlebasher anymore. I still LIKE google, but sheesh...bad data push? This isn't slashdot Adam. These are people that watch your results as closely as you do. The counts may be off...but they're ALWAYS off for whatever reason, and this guy simply beat the system, and it's making G look bad.

“Always tell the truth - it's the easiest thing to remember”

David Mamet

G got away from this and tried outsmarting webmasters and seo's at some point. You can fool most people most of the time, but you won't fool ALL the people all the time.

Whoever this guy is, did a good job using the loopholes...and that's not even to say I agree with his methods or techniques (i don't really)...but he beat your system by studying it better than you thought anyone would...the only way it was caught was with the help of others. Those same others that G likes to misinform because they might be of detriment to relevancy. I apologize Adam, for laying this out on you like you're Sergei or Larry or something...I do appreciate you being here and communicating and contributing...what you and Matt represent are the areas of G that I appreciate and hold as ideals...but damn...when did Google become M$ and throw all the idealism and cluetrain stuff out the window for corporate politics, pointless meetings, and propoganda?

Start by bringing back a real link command and you'll make a lot of people happy and reverse some of the negative sentiment that has been rumbling within webmaster communities and getting bigger by the day.

Quote:
I can honestly say I'd rather have less efficient crawling from only one bot with less screw ups any day of the week.

Seconded. Please add to the list making the robots mind standards rather than making their own as well. I KNOW there are smart people at G that can code a bot that minds meta tags, so how the hell does stuff like this slip past a QA test and get live?

Hmm, just how big is a Google Datacentre?

How many people pulling that same stunt would it take to fill every hard drive that Google owns, and how far are they off full capacity anyway?

Google indexing 5 000 000 000 pages from one site surely blows away any notion that currently they are running at maximum capacity (though they might have been just a few months ago - we'll never know).

I agree with the "bad data push" excuse. Yeah, the "bad data" was the 5 billion pages, and the "bad push" was the fact that any of it made it into the index at all.

Wrong

Those sites aren't actually in the Google index, you're receiving a bad data push to your browser.

Gah!

Okay this one made me spit out my coffee damn it.

Hehe....

...some quick manual action seems to have taken place.

MC found an internet café - perhaps?

SpamSense

Adam, please wait for few more weeks. You will have world record of trillions of webpage indexed. By now every spammer will copy the technique.

De-Indexing Problems in this "bad push" as well maybe?

It would seem that a de-indexing problem I faced in May (which did seem to get resolved fairly quickly), has come back over the weekend. Has there been any mention of de-indexing involved with this "bad data push"? The pages are recognized still with PR yet not showing any results in the "site:" search or when looking up the exact page. I get "Sorry, no information is available for the URL my-site.com/blablabla.html" Same as back in May. *clenches teeth*

Just us or are others seeing something like this amongst the weekend excitement? Or is this a completely different issue?

site: has been broken for a long time

Quote:
Abort, Retry, Fail?

*laughs* :)

I wonder how many pages they really have/had in the index? I know I have sites where I know the exact page count - impossible to get additional 200 responses - and google claims to have 6 or more times as many pages indexed (it changes, swings up and down, day to day).
Site: is broken and has been for a long time....

Bad data push ?

Maybe that is the whole point of Matt Cutts blog,to feed us bullsh*t and bait, then just sit back and wait for surfers to reply to his posts so they can get an understanding of whats going on. http://www.baddatapush.com

Those sites aren't actually

Those sites aren't actually in the Google index, you're receiving a bad data push to your browser.

MC found an internet café - perhaps?

LOL .. You people are crazy :)) LMAO

All I want to say is : I want my Gooooogle from 9 months ago.

It's not completly fixed!!

Yo, Adam, now that you banned all those domains, fix the site: bug that shows up when searching quality content sites, where you only get some, if any of the pages.

Some of the datacenters look fixed, while google.com is in the toilet.

I am also currently seeing

I am also currently seeing some more stable results with our upper level pages that are still indexed but the site: issue is still a big problem. Our number of pages dropped more over night. :(

Pages are still vanishing for quality sites!

The best is the t1ps2see.com

The best is the t1ps2see.com page titles. "The Search Engine You Trust".

Anyone notice the query on t1ps2see.com when you visit eiqz2q.org is "animal"? A clue?

Wonder if he's still for hire... was awhile back.

BTW, don't you think they

BTW, don't you think they should be putting this stuff on a more brandable domain? They'd get much better CTR I bet. Hmmm, maybe something like about.com would be catchy.

Just the tip of the iceberg

In addition to spam sites permeating the index in record numbers, perfectly legitimate sites are choking Big Daddy as well. Craigslist.org would be a good example. A search for wedding forum delivers page after page after page of results from the site. We can add the inability to handle subdomains properly to the laundry list of deficiencies plaguing Big Daddy.

Me thinks this problem gets worse before it gets better.

A brief followup

> PS: Why did you guys remove that "searching X pages" note from the footer ? (tricky question)

From what I gather (this was before my time here), after the numbers got to a certain point, they became pretty meaningless to our users.

> Fess up, no smoke and mirrors or PR spin please.

While my referencing of a bad data push was clearly not the most popular description on TW, it's what it was actually called by engineers here directly working on the problem. With that said, though, I recognize that my emphasis on the order-of-magnitudes-off results numbers made it seem like I was belittling the presence of the junk that was still listed. Mea culpa. Tersity (is that a word?) -- while understandable late on a Saturday night perhaps -- did not serve me or you well in this context.

> By now every spammer will copy the technique.

As you might imagine, we're making adjustments to prevent stuff like this from happening in the future.

Are they aware of the site:

Are they aware of the site: search issues showing up again since this data push? Any word on what is going on there? Or was this never fixed and I am just seeing it happen again now. Coincidence?

Blogger flaw

So Adam, Google will be fixing the Blogger flaw that's responsible for all of the faked 404 blogspot pages? An estimated 16,000 pages with links to these subdomains?

As you might imagine, we're

As you might imagine, we're making adjustments to prevent....

Automatic adjustments or "hand job" adjustments ? :)

PS: I can only appreciate you for coming back here and keeping up the high morale, after all the TW'ers comments.

Not a brand new issue

Quote:
PS: I can only appreciate you for coming back here and keeping up the high morale, after all the TW'ers comments.

I want to echo these sentiments as well. I, as well as the others, appreciate you taking the time not only to respond but the time to come back and address the additional issues as they have been raised.

This being said, the way subdomains are, and have been being handled, has not been as it should.

As I pointed out in an earlier post, a perfectly legitimate sight such as craigslist.org, has had its' subdomains dominating page after page of results for some time now. Subdomains are not and have not been handled properly. Throw in the fact that the supplemental index has been unsearchable for even a longer amount of time and I'm certain you can understand not only the frustration but the irritation of webmasters everywhere.

A "bad data push", as the engineers call it, and as you so aptly pointed out, is is not what folks wanted to hear.

I have to agree with Expertu that these "adjustments" appear to be nothing more than a "hand job" based upon what I've been seeing.

anybody know

How do I turn off ThatAdamGuy so I don't see his posts? I can't find it anywhere.

The Billion Page Genius Appears

The mad genius behind the sub-subdomains is posting in the Digital Point thread right now. He posted identity proof on t1ps2see.com; it is definitely him.

Ok, after a QUICK looksee...

Adam, the thing that gets me is that most of the Google team makes considerably more than me, yet can't seem to find stuff that takes only a couple of minutes to locate.

Checking to see if the issue is fixed, I do a logical check on abused TLD and uncommonly coupled keywords, then scroll to page 2 of 100 results per page:
http://www.google.com/search?q=site:.info+loan+pizza&num=100&hl=en&lr=&safe=off&start=100&sa=N

So, how many of these AREN'T spam?

Then I grab a domain a couple of listings down, and do a site: on it:
http://www.google.com/search?hl=en&q=site%3Au9vcvo.info

Is 11,200 (at the time of this posting) indexed pages in 17 days normal? That is assuming, of course, that Gbot found the first site on day one of the domain being registered. Based on that number, had I not just alerted you to this site and you banned it (as I assume you will), how many pages would be in the index 3 days from now? Or a week?

It's obvious that you manually banned the sites in question. Considering that this is relatively easy to duplicate, how long do you think before you have 2000 or more people attacking Google at once using this method? Is manual banning going to be anywhere near to enough? Or will it be attacking a woolly mammoth with a flyswatter?

Btw, if you need, I could probably write a bot to spot abuse like this early in the game.

Also, side note, if the bad data push is corrected, would the number returned by these queries be accurate?
http://www.google.com/search?hl=en&q=site%3A.com
http://www.google.com/search?num=100&hl=en&lr=&safe=off&q=site%3A.org
http://www.google.com/search?num=100&hl=en&lr=&safe=off&q=site%3A.net
http://www.google.com/search?num=100&hl=en&lr=&safe=off&q=site%3A.info

Do you know what the numbers were 4 weeks ago, and if they might indicate that there is still a high pollution index?

-Michael

I'd hope TW'ers would

I'd hope TW'ers would realize that just because people click on an AdSense ad, it does not necessarily mean that an advertiser is charged and a publisher is paid.

Is that even the relevant issue at hand there?

What about other publishers that may not want to be associated with auto-gen content? Or the general (lack of) perception of network quality?

Surprised nobody's mentioning

our other little spammerjammer a la itrafficweb origin that's been ever present. For a couple of times it was the same exact keyword list and very little modification that I could see and still working fine... just moved to a new (sub)domain.

Spamtacular.

Ps guys : Check out how many

Ps guys : Check out how many OTHER websites the spammer had in the index (I used the ip: operator of MSN).

If we could (becaue Google has banned the IP address, not just the domain itself, so all of them are banned) see the other websites too, I think that combined, they had tens of billions of pages.

Here are some domains caught from this Google Groups discussion:

site:cgq7wm.org
site:eiqz2q.org
site:t1ps2see.com
site:etlz8o.org
site:viwhha.org
site:qge6f7.org
site:rfni70.org
site:jkthy0.org
site:geku8h.org

So ?

The Shadow knows

Quote:
I'd hope TW'ers would realize that just because people click on an AdSense ad, it does not necessarily mean that an advertiser is charged and a publisher is paid.

Who knows what evil lurks in the hearts of men? (mp3)

Lol hardball ... It's like

Lol hardball ... It's like that Scooby doo cartoon.

Too old for something this fast

Expertu, that discussion was from Friday, you won't find anything new there. Try to follow in the forums (WMW, DP, SEORefugee). More up to date info there.

-Michael

Well, some of us are not so

Well, some of us are not so up-to-date like you are ;)

Interesting. Blogger has

Interesting. Blogger has fixed the take over of 404 accounts hole.

But it was just a "bad data push" right adam?

You guys sure seem to be moving awful fast right now :)

All of the faked 404 pages

All of the faked 404 pages are still indexed.

>>Bug, yah,

>>Bug, yah, right!!!

http://www.alexa.com/data/details/traffic_details?q=t1ps2see.com&url=t1ps2see.com

err, I see an alexa rank of 331,000. That means a few hundred visitors a day, right?

It means his pageviews are

It means his pageviews are crashing, that he had millions if not billions of listings in Google to get in the top 2,000 sites on the internet, then crashed when he got banned from Google!!!

Nintendo incorrect: his

Nintendo incorrect: his todays alexa rank is 3,615 his traffic hasnt decreased much really yet.

Today 1 wk. Avg. 3 mos.Avg.
3,615 1,940 331,498

Watch the linkage he's got

Watch the linkage he's got now.

Er, look

Er, look at...

http://traffic.alexa.com/graph?w=379&h=216&r=1m&z=&y=r&u=t1ps2see.com

His traffic is crashing, along with the other sites!!!

We know he has other domains

We know he has other domains still listed that have not been banned.

We know that one of his techniques is to redirect all of the traffic to t1ps2see.com.

He obviously still has rankings somewhere, with some domains, and is pushing that traffic to his main site. All of the other domains that were reported have crashed to absolute zero. This one is lingering on with continuing traffic.

LOL .. nice edit ?

LOL .. nice edit ?

Freaking Funny

This has to be the funniest thread I've ever read on a SEO site. I feel bad for Adam getting hazed that way by his fellow employees, brings back old college memories.

qwerty?

Why is my nick on the edited post? I only know a few porn sites, and don't spend more than 10 hours on them per day (each of them).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.