Craigslist Kills Search Spiders

10 comments

Craigslist has blocked most of their content from search spiders with their new robots.txt file.

Were they suffering from the same problem as WMW, or why block the bots?

Comments

LAME and NAIVE

I run a huge database driven site that gets scraped all the time and well behaved bots honor robots dot text but scrapers don't.

You have to go to some extremes to stop scrapers.

Even at those extremes it seems some keep coming via proxy servers and then some.

Just don't get shot in the ass trying to stop them.

It's a battle royale and I'm getting sick of this shit.

What Solution...?

You have to go to some extremes to stop scrapers.

Even at those extremes it seems some keep coming via proxy servers and then some.

Just don't get shot in the ass trying to stop them.

What is the solution then? This stuff is a constant worrier and when a portal/site has to go to extremes to block the whole lot off, then yes you are shooting yourself in the foot. Is there not some other way? Maybe, by blocking all robots except the big three?

Not that simple

Legitimate robots identify themselves and follow the rules but rogue robots don't, nor do they honor any of the legitimate robots rules such as robots.txt, NOFOLLOW, NOCACHE, etc. as they are their for their own purposes and will take whatever they want unless you physically stop them.

I think craigslist is trying just to stop sites like Oodle at the moment, which are legit and probably honor robots.txt.

However, a couple of entries in their robots.txt file lead me to believe they're doing a spider trap as well. If that's the case and they're trying to stop all the others like I've been doing, similar to the WMW situation, they're in for a wild ride.

damn...

Damn I am sorry to see that... it's another case where a legitimate site becomes helpless against this kind of scraping. We are building something at the moment with a ton of unique content on it, some written by Uni professors and other pros in their own field of subjects. It is constantly on the back of my mind that this stuff will be scraped. We have made every step possible to squeeze us shut legally but at the end of the day we know the 'bots' will come hammer the bandwidth take the content and remix/rewrite it.

Surely there is way of stopping this in the build, if there is, we have not figured it out yet...

well....

there goes one of my tricks... :(

Try this

Alex Kemp's scraper stopper maybe be just what you need. To learn more about it check out WMW threads on "badly behaved bots".

It doesn't deal with all the situations I'm blocking but it's a pretty good starting point.

absolutely...

It doesn't deal with all the situations I'm blocking but it's a pretty good starting point.

Absolutely, I'll give this a play around with and see if it will do the trick or at least can be built upon.

Thanks

Have a close look at that robots.txt....

It looks like its just housekeeping to me.

Looking at the robots.txt, I can't see that "Craigslist has blocked most of their content from search spiders "

If you run through the robots.txt restrictions, it's primarily blocking access to a few top level pages:

i.e.
/hhh/ is Housing
/ccc/ is Community
/jjj/ is jobs

etc.

I suspect that the site structure changed when they grew - and now everything is more generally ordered by location:

e.g. http://www.craigslist.org/ccc/ is the craigslist > san francisco bay area > community page

But the posts on that page go to directories like http://www.craigslist.org/sby/kid/1234567.html - which isn't blocked - where /sby/ is Southbay area; /sfc/ is San Francisco City etc.

And the 'location' subs like /sby/ and /sfc/ etc. aren't blocked.

Also - the 'blocked' /forums returns a 302 redirect to http://forums.craigslist.org/ which isn't blocked.

I don't think this has anything to do with 'bad bots' Aaron - I suspect its just housekeeping....

Have a look at the internet archive http://web.archive.org/web/20050401051750/http://www.craigslist.org/robots.txt

The only changes since April 1 2005 are the addition of the subs:

Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj

Which are now organised by the locations - which aren't blocked.

http://www.google.com/search?hl=en&q=site%3Acraigslist.org I'm seeing over 11 million pages indexed at Google and over 7 million at Yahoo http://search.yahoo.com/search?p=site%3Acraigslist.org&sm=Yahoo%21+Search&fr=FP-tab-web-t-296&toggle=1&cop=&ei=UTF-8

The more I look - the more I'm sure it probably occured when they started rolling out the location based structure...

thanks, Chris

Good work and helpful clarification post.

Could be

But why would you block all of your old links via a robots.txt which would kill references in a hurry vs. the use of redirects?

Went thru that same kind of house keeping situation myself and the redirects seemed to clean up the search engines over a couple of months.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.