TheGoogleCache

17 comments

In response to the recent Google Cache Fair Use ruling, this site is actually caching Google, page by page. http://www.thegooglecache.com

Comments

Good Catch !!

Good Catch !!

How did you find that site so fast after it got put up ??

;-)

oops

someone has spotted the sites scripting flaw - site now redirecting to php.net

the sites scripting

the sites scripting flaw

that should be plural "flaws"

really ..

it's know going to Matts Blog ;)

DaveN

added the non www of course

At least its not going to

At least its not going to lemon party.... yet

index page text

That redirect is annoying. Here is the text from the index page:

---------------------------------------------------
What is TheGoogleCache
This is a protest site, pure and simple. I am not in anyway affiliated with Google, and they certainly do not condone this website. If Google sends a mean enough letter, I am sure that I will cower and take this site down in a second. After the recent ruling that stated Google's cache of copyrighted materials was "fair use", I decided to put this to the test myself. This is The Google Cache. You search Google, your results get cached. It is that simple. Is it legal?

Why Am I Doing This?
The Google cache is absolutely ridiculous. As an individual who has had quite a bit of experience on both sides of the white hat / black hat search engine industry, the cache is NOT a webmaster's friend.

1. The cache removes content control away from the author. For example, a site like EzineArticles.com prevents scraping by using an IP blocking method based on the speed at which pages are spidered by that IP. It is absurdly easy to circumvent this by simply spidering the Google cache of that article instead of spidering the site. Google's IP blocking is far less restrictive, and combined with the powerful search tool, it allows for easy, anonymous contextual scraping of sites whose Terms of Service explicitly refuse it.

2. The cache extends access to removed content, often for months if not years at a time. Google rarely replaces 404 pages (perhaps it is because of their wish to have the largest number of indexed pages). I have clients who have nearly 48,000 non existent pages still cached in google that have not been present in over 14 months. Despite using 404s, 301s, etc. these pages have not yet been removed. Furthermore, Google's often mishandling of robots.txt, nocache, and nofollow leaves webmasters dependent upon search traffic hesitant to force removal of these pages using the supposedly standardized methods of removal.

3. The cache allows Google to serve site content anonymously. Don't want the owner of a site to know you are looking at their goods (think of companies grepping for competitor IPs), just watch the cache instead.

The list goes on and on. But I think the point is this...

Why should a web author have to be technologically savvy to keep his or her content from being reproduced by a multi-billion dollar US company? Content control used to be as simple as "you write it, its yours". It got a little more complicated with time to the point at which it might be useful to use, perhaps, a Terms of Service. Even a novice could write "No duplication allowed without expressed consent". Now, a web author must know how to manipulate HTML meta tags and/or a robots.txt file.

Legality
Actually, what I am doing has been legal for a long time. This is where a user-directed query is cached to be easily accessed in the future. This is different from a bot, which has different rules altogether. Anyway, here is the jargon. 1. This is not a robot, it is more like a proxy. 2. This is consistent with DMCA 512(B) of caching (A) the material is made available online by a person other than the
service provider (C) the storage is carried out through an automatic technical prcess for
the purpose of making the material available to users of the systm or
network who, afterhe material is transmitted as descried in subparagraph (B), quest access to the material from the perso described in subparagraph (A) . And Condition (2)A-C are met, more specifically, section B, which excludes service providers when the servcie is brought forth by person said in Section (1)(A). This site is not to be associated with any business or company for which its creator, Russ Jones, works, except perhaps for Google Adsense, because it would be funny if Google stuck it to themselves.

Fun with Legal Speak
cited: EFF
# Serving a webpage from the Google Cache does not constitute direct infringement, because it results from automated, non-volitional activity by Google servers (Field did not allege infringement on the basis of the making of the initial copy by the Googlebot);
Field's conduct (failure to set a "no archive" metatag; posting "allow all" robot.txt header) indicated that he impliedly licensed search engines to archive his web page;
The Google Cache is a fair use; and
The Google Cache qualifies for the DMCA's 512(b) caching "safe harbor" for online service providers

Just replace "Google" with "Scraper Site" and "Fields" with "Your". It is fun, because basically what it does is allow textual copyright infringement across the web if you are offering another webmasters content merely as a "cache" or an "archive".

What is Google's Stake
This is a really hard question to answer if you believe that Google is not just out to make a buck. First, this is a really easy situation to fix. All Google would have to do is assume nocache instead of cache. Your pages would still make it into the search engine. The would still be indexed and searchable. Google would still get their search results. BUT, Google could not reproduce/republish the entirity of your content without your expressed permission. This would just be like Google Print.
So what does google get out of the Cache? The only intrinsic value that is passed to the user is that Google can pass old information or no-longer-available information to the site visitor. This means that Google can make a profit (think adwords) by selling your old or no longer available content. Thats it. It is just to make money.
---------------------------------------------------

not really sure how caching

not really sure how caching google.com makes the point -- i see it adding value to google, not hurting it.

and besides google is just other people's content remixed, so if you cache google, are you stealing from google or the web publishers you're trying to defend?

Stoopid

Why should a web author have to be technologically savvy to keep his or her content from being reproduced by a multi-billion dollar US company?

If you're technologically savvy enough to get your content online dropping one more line in the page to disable cache is trivial.

If everyone disables cache this becomes a moot point so get busy fixing your pages instead of railing against it like little kids tossing a tantrum and we can get past this.

I agree, if you publish

I agree, if you publish something questionable you should have put a meta no-index on it in the first place, the whole cache thing is blind anti-google hysteria. Besides, if you post anything scandalous chances are it's going to be mirrored in a lot more places than Mountain View before long, ESPECIALLY is you try to cover it up.

The Google cache serves a legitimate useful purpose - if the site you want is down, use the cache. Bloody useful.

Very Useful

The Google cache serves a legitimate useful purpose

True - it lets everyone you have blocked from accessing your website download your content via Google.

Responses...

(1) It is just anti-google hysteria: On the contrary, it is analyzing the basic principles of copyright ownership. We have long accepted that snippets (just like quotes in a book) are acceptable. But, in no shape or form, in no other media, has anyone or organization been free to lawfully reproduce / republish an entire work with out express, written permission. What people don't understand is that anyone, even competitors, can arguably cache the contents of your site as an archival service.

(2) I believe that one of the greatest driving factors of information on the Internet is content for profit. If web authors must become technologically savvy to prevent copyright infringmenet, they are going to be less inclined to continue using this method. I can already write a simple proxy server that "surfs the cache" as opposed to the author's site.

(3) The Google cache is useful when a site goes down, but it would also be useful to have a printed copy available at the local bookstore if the site goes down as well. Unless the content author expresses that he/she wishes to have that reproduction available on Google, it should not be republished.

(4) The Google cache is harmful as well. Just search for "displaylinks()" and you can turn up thousands of results of hapless people using Link Vault (probably with the autoupdate feature) who are now vulnerable to all kinds of things. They are only vulnerable, however, because the cache allows you to find their LV code. If only the snippet were provided, you could find sites using LV, but you could not find their particular LV code and, thus, the writable files, etc. Just check out johhny.ihackstuff.com if you want a list of everything you could ever imagine.

Just Google?

Yahoo and MSN cache too, so do all the second tier engines.

Google just gets all the hysteria.

The argument about not making an author become technologically savvy for that one line of code in a page is silly as they also need to know how to insert meta descriptions and meta keywords to help get their pages indexed so learning about nocache is such a stretch?

If all the idiot proof tools have nocache built into them, such as blogs, mambo, FrontPage, Dreamweaver, etc. then it becomes much less of a problem.

More proactive changes instead of discussion and the landscape could quickly change in days.

FWIW, someone could easily generate a simple CGI script that could crawl the httpdocs folder and insert the nocache commands into any existing web pages site wide thus eliminating the argument of updating old sites with thousands of pages.

no stretching

they also need to know how to insert meta descriptions and meta keywords to help get their pages indexed so learning about nocache is such a stretch

No, they don't.

There are a ton of people who don't give a damn about meta headers of any kind. The resident experts at TW might know all about them and care that they are properly deployed for maximum effect. I would argue that they are the exception.

Arguing that caching should be opt-out until you are blue in the face is not going to make it right. In any case, publishing a cache has nothing to do with the ability to index the web or publish serps.

In the case of Yahoo!, it is noted by the World Association of Newspapers in an article today, that they PAY for the right to display articles from an undisclosed number of their members.

(internal TW link)

Like other publishing partnerships that Yahoo! engages in, money is changing hands and the partners are happy. The difference is that Google keeps on trying to do the same things without paying and without permission.

In some circles, that is still called theft.

As long as Yahoo! keeps paying and making new deals, big publishers will always notice that there is another party, say Google, that is trying to get the same "stuff" without equivalent compensation. Naturally enough, they will then attempt to protect their revenue stream.

Not arguing opt-out

You missed my point entirely.

I'm saying OPT-OUT is the status quo on ALL the search engines so arguing it should be opt-in and railing against the system is currently a non-starter as it's not changing any time soon that I can see.

You're better off fixing the tools, templates and CMS systems to opt-out and be done with it instead of going on and on that it should be opt-in instead.

My sites have all opted-out with the exception of my stupid blog, how about yours?

Be proactive, more action, less words.

serial killers

So, to paraphrase, you are saying that if there is a serial killer on the loose, everyone should just forget about catching the killer and concentrate solely on personal security measures?

Somehow, that doesn't seem right.

>> You missed my point

>> You missed my point entirely.

Your point seemed to assume that anybody with a website has the technical savvy to create meta tags. In fact, it could be argued that you were stating that those meta tags were required for a site to be indexed in the first place.

No matter how many SEs cache ... as that cache is not required for the normal functioning of a search engine they are taking liberties by making it opt-out.

>> My sites have all opted-out
Great, well done. Now if they change the rules and say that they are ignoring robots.txt and only recognising nocache requests via meta tags you'll be happy to revise every page in your entire network?

Didn't say it was right

Said it's the status quo - so I opted out of the status quo.

Debating it is pointless at this time as it's nothing they'll change for us unless you want to start a class action lawsuit. Personally, my time and money was better spent altering my web site to opt-out.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.