Everything you need to know about Google Sitemaps

10 comments
Source Title:
How to Make Use of Google SiteMaps
Story Text:

Sebastian takes us on a *definitive* guide, and tutorial on Google's Sitemaps. It's one hell of a post, and is well worth the bookmark.

I've got Sitemaps running on Threadwatch, and didn't find it that difficult to set up, even the regex on the filters wasn't too bad, but Sebastians guide goes much deeper than my knowledge.

He's covering everything, including advanced topics such as:

  • How to roll your own Sitemap
  • Sitemap stats - seriously, i didn't know anything about this!
  • Discussions, resources and professional help

It's a great read, and one i've tucked away for a quiet evening of tinkering around :) I really want to try out that stats proceedure, it's a bit involved, but it looks like fun!

Comments

hehe! looks like you'll have

hehe! looks like you'll have to make an update already Sebastian :)

Google Mobile Sitemaps

Mobile sitemaps

Thanks for the kind words:)

I've mentioned the mobile sitemaps in the stats section. Because I've no WAP pages yet, I wasn't able to discuss them in detail. Perhaps I'll convert a few of my feeds and give it a try. However, the principle is simple. A mobile sitemap is a standard Google compliant sitemap populated with URIs of WAP pages of one particular markup language (XHTML, WML...), submitted via another form.

you knew it was coming...

For security reasons, Google will not verify a location, if the Web server's response on invalid page requests is not equal 404 (the error message says 'We've detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.' but it occurs even on redirects, e.g. 302

Should I understand that Google hits the site with a bad request every time it validates, just to check this? and that bad requests looks like what in the logs?

ahh... ok

So what kind of information can you get out of those stats?

Verification + Info

The verification process is a one time procedure meant to ensure that a webmaster (account) is permitted to view the stats. After the verification this file should not get requested any more. You can change your .htaccess for the few minutes the verification process takes.

Information provided on the stats page includes only errors per URI, for example

* Googlebot following an invalid inbound link ran into the 404 error page

* Request of a dynamic page produced a 500 error

* Caused by connectivity problems a page couldn't get fetched

* Page in password protected directory not available

* Tried to fetch /somepath/xyz.html but your unfriendly robots.txt kicks me out

The whole thing was released today, I expect that Google is still working on the error descriptions. For example an error caused by password protection gives the same error text as a 404, the page carrying an invalid inbound link is not listed, timestamps are missing and so on. However, that's way more info than the server logs or even high sophisticated bot trackers can deliver.

as a public service

I imagine a web service which proxies Google sitemap account management. Of course the Google TOS can change at any time :-)

Google TOS can change at any time ..

And very likely will.

Difference now, is that sites using sitemap have a contract with G, and can no longer claim "I don't give a toss for Google's TOS".

We live in interesting times; I note that already sites are 'disappearing' after submitting a site map; at least it's a quick death, I suppose ...

Sitemaps have given Google more power than any other single development. I wonder what they'll do next?

Disappearing sites

I've checked a few cases of disappeared sites. Everywhere I've looked into I found a good reason (in terms of Google's policies) to deindex the site, for example massive amounts of doorway pages, obscure or duplicated links pages ... so far the quick deaths.

There seems to exist another kind of temporarily death assumed to appear in conjunction with sitemaps. Some sites get moved completely into the supplemental index, or all listings change to url-only shortly after a sitemap submission. I've not yet identified the pattern, but it seems that in some cases Googlebot-Mozilla inspects a site in preparation of a deep crawl, and if she finds out that Google's index is in no way up to date, the indexed pages get wiped out. That's not that uncommon with partly indexed huge dynamic sites which phase out a portion of their content regularly or rotate the content. It seems to happen only when the gap between the sitemap submission and the currently indexed stuff is significant. Usually the site's standing after the resurrection (deep crawl and (re)indexing) is pretty much better, but this process can last a few weeks.

I'm very interested in more information on the index-coma-phenomena. If my assumptions apply, those sites probably should start with tiny sitemaps (small bunches of pages submitted step by step) to avoid a massive traffic loss.

>url only Exact opposite for

>url only

Exact opposite for me - didn't notice any particular increase in traffic, though i *think* there was a bit as adsense earnings went up shortly after i did sitemaps. (i tend to judge TW on the conversations we're having not traffic).

Some weeks after doing a sitemap, tw's serps look much prettier :)

Doesn't apply

TW doesn't match the pattern. A huge shop selling short living products would do. If the nav scheme is near to 100% hierarchical and SKUs are phased out frequently, chances are good that the current site's content and the indexed copies of pages have not so much in common.

Sitemaps can help AdSense in better targeting the ads and they definitely help with indexing. Sometimes the AdSense bot recrawls pages shorty after sitemap updates, and perhaps you find Mediapartners-Google/2.1 requesting the XML file in your logs.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.