Nobody likes a scraper, not even a well funded one...

Two or three weeks back, job listings site Oodle was blocked by uber community Craigslist. Both have been rah rah'd as web2.0 icons (well ok, oodle much, much less so). The thing is, even if you have a bunch of VC cash, and a cool name, a scraper is a scraper, and guess what? Nobody likes a scraper...

I saw the headlines when this happend, and didn't think it so newsworthy at the time, but Tom Foremski at SVG just posted some interesting data on exactly why Craigslist doesn't like being scraped for other peoples profit:

Quote:
I had an interesting chat with Jim Buckmaster, ceo of craigslist, about this issue. Jim said that Oodle was the most aggressive in checking its listings and this was slowing things up for users.

Jim showed me a chart of craigslist traffic and how much traffic Oodle was bringing, and you could barely see Oodle's red line graph coming up off the x-axis, while the blue line of craigslist was flying high up in the logarithmic realms of the y-axis.

"We try and be fair and reasonable but aggregation sites like Oodle put a big strain on our infrastructure," he said. "We don't want our users suffering because of this."

It funny, because 4mts ago you couldn't move in the blogosphere for people yelling at the top of their 20something lungs about the joys of "remixing" and "aggregation", and to an extent, that thinking still continues, but lately there's been somewhat of a backlash.

Now we're suddenly seeing VC's, businesses, like Craigslist and pundits start to cry foul over what is, whichever way you dress it up, scraping.

Roll on the maturation of the current set of teenybopping, vc loaded, "what business model" remixing morons i say. Idealism is for the bathtub, or a quiet smoke in the garden, when there's money at stake it can take a hike...

- Y! MyWeb

Unless you're google...

Everyone hates a scraper, umm except Googlebot and the like. And if you don't think Google is in the business of 'scraping' specific data off pages then consider Froogle which does try to identify prices of items on pages and then display those in froogle (in addition to accepting data feeds from merchants). They also scrape the ratings of stores from shopping.com et al.


That's kind of the way

That's kind of the way Foremski puts it in his article:

Oodle and all the other google-like search-and-scrape sites...


The difference is craigslist doesn't block Googlebot

google search: site:craigslist.org
Results 1 - 10 of about 12,400,000 from craigslist.org
So that's what's sketchy about it ... they make this argument that scrapers are bad but let it go on depending on who's scraping.


This also happened to Google

According to The Search, similar things happened when Google was starting out. Nervous webmasters saw the Googlebot eating up bandwidth and accused G of all sorts of things. Back in the early days it wasn't bringing much traffic either... But look at it now. Since Oodle sends all their traffic away (like all search engines do) there is little harm in it.

Perhaps instead of banning scraping, set up a direct feed and charge Oodle for it. Why not have people be able to find your listings from the most places possible?

I like Craigslist to an extent, but it has way too much spam and is reallllly poorly designed usability wise. I would much rather search Oodle and get to CL through a nice search interface.


Good Business - there is

Good Business - there is enough in it for both parties (bot eats resources but engine gives traffic)
Bad business - one party gains, the other loses (scrapers eat resources but give nothing back)

Web 2.0 - needs a Good Business case, otherwise dead.in.water.


Spiders vs Scrapers

I'm not sure people are rational enough about the topic to even know what a real scraper is these days. Google and Oodle both add value to the content by organizing it and helping people FIND the original content, which is quite difficult without their help.

Take a quick once over of Oodle and I think we can all easily say it passes the sniff test as the content snippets are very minimal, it's well organized and appears to be quite a valuable resource for what it's intended to do.

I think Craigslist is just hyperventilating and should approach Oodle with some mutually beneficial solution, just as a list of daily changes to craigslist to thwart Oddle's spider from doing so much work.

Here's an example of scraping with NO value added:
http://www.clan-fta.us/articles/1920-fashion.html

If you can't distiguish the difference then it's a sad day.


I think the problem has less

I think the problem has less to do with the "search-and-scrape" sites and more to do with the netiquette of bots/crawlers.

A couple months ago several people I know were hit by bandwidth overages, or were shut down when yahoo started aggressively searching for images. How much of it was yahoo's fault and how much of it was the webmaster's fault for having multiple high resolution photos that were opened to be crawled? A would wager a little bit of both

Also, I don't think the number of links from search engines really paints the entire picture. It doesn't count the people who found articles on the search engine and then posted it to their blog, or sent it to a friend. Granted this may end up to be a marginal difference, but my point is measuring search engine's effectiveness (or return) through one vector can be inaccurate.

There was a deal (of sorts) struck in 1996 that the robots.txt protocol would serve as the guidelines for a web site operator could impose on well behaving spiders/crawlers/bots. This imposes the onus on the webmaster to create one of these files and on the programmer of the crawler to obey them.

While not an official standard, all major search engines obey this rule. MSNbot and Inktomi Slurp (Yahoo) will even obey the Crawl-delay directive so if they are using too much bandwidth.