Bad Data Push part deux

26 comments

ok let play a game ... :)

http://www.google.co.uk/search?hl=en&q=christine+dolce+nude what at #1 (i have screenshots)

hmm I thought :

a) Google had a robots.txt which stopped this from happening
b) if they do abide by their robots.txt then having the keyword in the url is really really important
c) they are just screwing with me

DaveN

Comments

They *are* violating their

They *are* violating their own robots.txt with that though. See site:google.com/search phone

Maybe...

...AOTA? Robots.txt is often 'ignored' during updates and shit.

I think b) is not really exotic.

About c).... Are they screwing with you as much as you are screwing with them?

c)they would of picked

c)they would of picked something else off the origanl list and made you number one for either...
1.fat arse
2.tosser
3.hahahaha
:)

Well... At Least Its

not an IMAGE search :)

Oho, so now "bad data push"

Oho, so now "bad data push" can be used to refer to anything you dislike, like "snakes on a plane"? :)

If you put quotes around that search, I only see 61 results, and your earlier post about it is at #2, so it's not a very competitive phrase. In fact, the only links to the Google SERP are from you, Dave.

But I think you're asking about how a url which is forbidden by robots.txt can show up in our search results. I thought everybody knew the answer to that, because I addressed it in my many-splendored post at http://mattcutts.com/blog/googlebot-keep-out/

Allow me to quote liberally from Chapter 1, paragraph 10: ;)
"Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. I’ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadn’t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones I’ve mentioned so far) let Google into more of their pages these days."

Which word am I looking for: wanker, or pants? :)

Which word am I looking

Which word am I looking for: wanker, or pants? :)

Probably "pants". :)

HAHA

Matt didn't you get the memo ... it's bait Matt Friday today :)

DaveN

But Matt, don't you think

But Matt, don't you think that indexing your own SERPS is a bit, well, rubbish? How can google.com/search?q=christine%20dolce%20nude&hl=en be a relevant result if that's the exact page I'm on at the time?

I wouldn't bait Matt too

I wouldn't bait Matt too much, who is to say that Google will not just have it with spammers once and for all and tell Matt to not even bother trying to help legitimate webmasters anymore?

indexing your own

Quote:
indexing your own SERPS

sounds like something a spammer would do, but guess they'd remove the self referencing result so not to look too stupid

Well, it's showing up

Well, it's showing up because of Dave's links from davidnaylor.co.uk; Dave, if you want me to take care of that or ask someone to investigate, just let me know. ;)

Ha ha, it's bait Dave Friday now..

proposed special logo

Whether there's a plausible reason for it showing up or not, bottom line is it's not useful to have the page that I'm visiting show up in the SERP.

I created a proposed logo situations like this. I'd insert it here, but I'm not allowed to have images :)

Oh Oh

Ha ha, it's bait Dave Friday now.

You mess with the bull, you get the horns!

Ah, Dave took out his

Ah, Dave took out his frustration at being baited by making someone run around naked in Ripon:
http://www.davidnaylor.co.uk/archives/2006/08/18/not-daven-nude/

Or maybe that really *is* DaveN? Streaker in Ripon, he's not the pants..

aww

I was actually hoping to see christine dolce nude ... of course I have never heard of her.

But whatever happened to the

But whatever happened to the idea of not indexing a URL blocked by Robts.txt unless there are external citations pointing at that page?

http://mattcutts.com/blog/googlebot-keep-out/#comment-18207

external citation

Quote:
But whatever happened to the idea of not indexing a URL blocked by Robts.txt unless there are external citations pointing at that page?

but there is an external citation on:
http://www.davidnaylor.co.uk/archives/2006/07/11/google-sitemaps-again/

Doh!

Doh!

What happens is..

If the target page has a meta robots noindex tag on it, then that target page does not appear in the SERPs at all.

However, if all that excludes the bot is a disallow directive in the robots.txt file then that page will appear as a URL-only entry in the SERPS. The content at that URL will not be indexed, but the URL will still show up.

Yahoo goes one better: for a URL disallowed in robots.txt the SERPs usually show just a URL-only entry, but sometimes they build a title for the Yahoo search result using the anchor text from some other trusted site's link that points at the disallowed URL, but only where that anchor text is not "click here" or some other low quality generic text.

I thought this was SEO 101, maybe y'all like to catch up on some of the basics sometime.

If the target page has a

Quote:
If the target page has a meta robots noindex tag on it, then that target page does not appear in the SERPs at all.

That is not something you can count on with Google. I have a site with hundreds of pages in the index despite the fact that they have all a meta robots noindex tag.

Check it out ...

If the page is also covered by a robots.txt disallow directive, then Google never gets to see the meta robots noindex information, so the entry will appear as URL-only anyway.

If the pages are excluded by robots.txt or by robots meta tag, and the page content itself is being indexed or cached then you have maybe found a bug with Google.

However, carefully check that if you are using a robots.txt file that there is a blank line before each new user-agent section (otherwise the latter sections are not "seen" by their respetive bots), and that there is at least one blank line at the very end of the file. Also make sure that all Google-related information is in the "Disallow: Googlebot" section if one exists, else it should be found in the "User-agent: *" section. That is, if both sections exist, make sure that everything that you want Googlebot to not index is included in the "User-Agent: Googlebot" section, even if that means duplicating things that are already in the "User-agent: *" section.

Googlebot follows the "Googlebot" section if it exists, and ignores the "*" section if both exist in the same file.

I've found that URL only

I've found that URL only SERPs are sometimes (always?) due to links from external sites to a page blocked by robots.txt.

Correct.

We're all on the same page with that.

No robots.txt

The site in question has no robots.txt file at present (it's on my to-do list).

other errors

Nice comentary g1smd. I held back my own and appreciate your choice of language.

Take a good close look at your meta declarations. I have found that the slightest deviation from the Google Standard is taken as license to ignore the statement. You can try for yourself (many of you do already... I've looked at TW member sites and seen all variations of NOINDEX,NOARCHIVE,NOCACHE, ALL etc....)

Test for yourself or study the serps. Until you can demonstrate factual evidence, you're not done studying/investigating IMHO.

"I have a site with hundreds

"I have a site with hundreds of pages in the index despite the fact that they have all a meta robots noindex tag."

buckworks, are they url-only? If we never crawled the page, then o' course we wouldn't have seen the noindex meta tag on the page. Sometimes we can make snippets from url-only pages if your site is in the Open Directory, too. If it's not either of those, I'd love to see the example pages to ask someone to check it out. Earlier in the summer I think that the supplemental results weren't handling noindex correctly, but I believe it's all good now.

seobook, what mm1220 said. Three snaps up, mm1220. ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.