Using Canonical Domains to Sabotage Competitors in Google

52 comments
Story Text:

Most folks around here already know this, so this is for the uninformed and unprotected... Check all your domains if you have not already (I didn't and I lost a site to this). Check to see if you are protected.

In your browser address bar type in the URL of your website, that you don’t want to drop in the rankings (with and without the www) take note of where you end up. Does the URL have a www or not? (using Webbug and looking for redirects is a much better way). If you do not use the www and you send up at the www page or vice versa then the site is protected and this won't hurt.

If however, when you enter the URL without the www and you end up at the non-www URL and when you enter the URL with the www and you end up at the www URL your site is prime for a major dumping on.

If your site is not protected someone could;
Find some places to drop some links for you. This evil person could drop some links using the non-www URL (or if the site is set up for non-www linking link to the www page).

If someone were to do this to you, you would see the results in google in a very short time. The site that got hit can expect to be out of the running for some time, even if you discover the problem and correct it, it will take months to recover.

Using the incorrect URL as described above will trigger a duplicate penality in google for the effected site.

Their are several ways to protect your site. The simplest way I know of is to use a 301 redirect.

Of course, if you are "evil" you could reverse this and go after your competitors...

Comments

If you have htaccess....

RewriteEngine On
RewriteCond %{HTTP_HOST} ^url.co.uk
RewriteRule (.*) http://www.url.co.uk/$1 [R=301,L]

when you check the headers you get a 301 from the non www and a 200 from the www version.

Cheers

I think I ...

I think I have it now:

-- if you don't ensure that two URLs (with and without the "www") go to the site, and

-- If the non-www URLs don't redirect to the www URLs (or vice versa) ...

someone could point some links to the one you're *not* using and thereby get it picked up by Google, triggering duplicate content penalties. Is that it?

wildcard subdomains

I saw an incident once where wildcard subdomains were in place on a large site. Things could have gotten ugly, but they were adised to remove them. Imagine the possibility of creating 100's -10000's of duplicate sites. Ooooops

ukgimp, would you mind

ukgimp, would you mind emailing me that rule, for TW? Im not sure it's come out right on the system and i'd like to put that in...

Specifically so that //threadwatch goes to ww.threadwatch - i have other subdomains that need to be ok...

Thanks!

Worked great

Thanks for this code. It worked great straight away.

I've been looking around for this for a while but seem to have seen several different variations at WMW as well as through searching Google.

What I had seen with some variations is including [NC] (I think that's right) at the end of the second line to specify case-insensitive.

However, just from the code on this page I tested:

HTTP://DOMAIN.COM
HTTP://DOMAIN.com
http://domain.com/Page-Name.html

And every variation thereof. All worked great.

And just in time too, since a friend was actually pointing to one of my domains without the www. and Google had already put some of my pages into supplimental results.

Not too serious since that site isn't designed for SEO (it's a direct-response site), but a 30 second fix certainly never did any harm.

Backslashes

The backslashes in ukgimp's example have been eaten by the system, so be careful ;)

You can also do this with PHP (or ASP etc.) if you are generating your pages as you are here:

< ?php if ($_SERVER['HTTP_HOST'] != "www.example.com") {
header("HTTP/1.1 301 Moved Permanently");
header("Location: http://www.example.com".$_SERVER['REQUEST_URI']);
exit; }
else {
// your usual HTTP headers go here
};
? >

(remove spaces around the opening and closing tag brackets)

Thanks lots0, and ukgimp and

Thanks lots0, and ukgimp and frank for the code!

Got it sorted at TW now, but have to look into the wildcard subdomain thing next....

For Frontpage users

FP users have faced a lot of problems with .htaccess that have been very, very difficult to resolve. Either the site or FP or both would break when trying to incorporate other 'normal' directives into FPs default file.

I've just run across this in the most recent
WmW Bourbon thread. (You never know where you'll find that one great nugget deeply buried; just ignore all the noise around it.)

I have not yet tested this -- still gulping my morning caffeine -- according to bumpski you have to include Options +FollowSymLinks following RewriteEngine On

RewriteEngine On
Options +FollowSymLinks

Then, the most important part: Who the heck knew that some of the FP-specific directories had their own .htaccess files? So, in each one of these...

_vti_bin
_vti_bin/_vti_adm
_vti_bin/_vti_aut

...and for any subwebs, the Options None must be changed to Options +FollowSymLinks.

Not sure if I'm going to have time to give this a shot today, might be able to get it up on a test site overnight or tomorrow.

Of course the easiest

Of course the easiest solution to the Frontpage issue would be to get rid of it and all of its extra stuff! :P

As we have some formatting

As we have some formatting issues hampering this thread, i've posted the Rewrite rules that are working for me in a text file for anyone it may help.

the 'protected-subdomain' bit is for any sub that needs to work properly - Im fairly sure that those rules *should* be redirecting any wildcard subdomains to www.threadwatch but it's not doing that, so if any can improve, let me know :)

Lose the dot or the threadwatch.org bit

The line should either read:

RewriteCond %{HTTP_HOST} !^protected-subdomain.threadwatch.org

or:

RewriteCond %{HTTP_HOST} !^protected-subdomain.

here's the full thing

Nick, I think I recall posting something like this for you before... a whole lot of weeks ago, even. Can't find the post now. Anyway, voice is right regarding your text file.

Here's an alternative full-featured version with all the bells and whistles, and explanations too (for those who want it to be 100% perfect):

#------------------------------------------
# First optional line, include only in case of errors:
RewriteEngine On
# Second optional line, include only in case of errors:
Options +FollowSymLinks
# Optional start tag, requires use of corresponding end tag as well
< IfModule mod_rewrite.c >
# ----------------- the real stuff starts here
# IF there's a host field at all, AND
RewriteCond %{HTTP_HOST} .
# IF domain does not start with www, AND
RewriteCond %{HTTP_HOST} !^www\.threadwatch
# IF subdomain is not another one of those you like
RewriteCond %{HTTP_HOST} !^sub1\.threadwatch [NC]
RewriteCond %{HTTP_HOST} !^sub2\.threadwatch [NC]
RewriteCond %{HTTP_HOST} !^sub3\.threadwatch [NC]
RewriteCond %{HTTP_HOST} !^sub4\.threadwatch [NC]
RewriteCond %{HTTP_HOST} !^sub5\.threadwatch [NC]
# THEN redirect everything to an appropriate location
RewriteRule (.*) http //www.threadwatch.org/$1 [R=301,L]
# ----------------- the real stuff ends here
# Optional end tag, only if you have used the optional start tag
< /IfModule >
#------------------------------------------

All lines with "optional" in the comments are optional. There should be no spaces within the angle brackets in the optional start tag and corresponding optional closing tag.

Also, there's an ":" missing before the "//" in the RewriteRule. I removed that on purpose just so I didn't link to a 404 page.

Some bots don't use the host field, as that's a HTTP 1.1 thing. You can keep the comments in the ".htaccess" file

-----------------------
Added: [NC] flags. (Optional)

As noted above, these mean "No Case" and hence the condition is case insensitive. Ie. the subdomain can be spelled in uppercase or lowercase letters and it will still be caught by the condition.

The RewriteRule in this case makes sre that URLs that match the conditions wil get translated to an all-lowercase www subdomain.

We still want wWw.ThReadWAtch.oRg to be redirected to all lowercase, so the NC flag is not on the first RewriteCond(ition).

NC flags are fully optional.

More than meets the eye...

The 301 re-direct is truly the only way to ensure this does not happen even if your DNS entries seem to be in order.

Other tangible benefit

This is good practice for another reason...

Some folks unintentionally link to non-www version of your site. Might as well harvest that link pop as well.

This has been on the "quick and easy tips" list for a while.

Sorry

I just re-read my first post in this thread and it makes very little sense.
Diane said it so much better.

if you don't ensure that two URLs (with and without the "www") go to the site, and

-- If the non-www URLs don't redirect to the www URLs (or vice versa) ...

someone could point some links to the one you're *not* using and thereby get it picked up by Google, triggering duplicate content penalties.

The 301 re-direct is truly the only way to ensure this does not happen even if your DNS entries seem to be in order.

I don't know for sure but, I really hope this is not the case.
The domains we tested that did have correct Domain Name Server Settings did not seem to be effected, but we only tested a few like that.

I had correct DNS entries for both www and non-www...

...and they resolved to the same page (but without the 301 re-direct, the URLs remained at what was entered in the address bar). Someone linked to me not using the "www" (as I prefer "www" in all my work) and had some duplicate pages indexed on Google. They were removed a bit later and it looks like some of the originals were removed too (my page count in Google went down to a level before the non-www's came into the picture). Google doesn't index my site much (yet!!! ugh!) so it is easy to see what is going on.

I say 301 just to be sure.

Thanks for the recap claus,

Thanks for the recap claus, i couldn't find what you originally showed me either, but i'll go through that code and try to improve what i've done so far :)

Windows and ISAPI_Rewrite

And for those of you on a Windows Server using ISAPI_Rewrite, it's as simple as adding this to your .ini file...

RewriteCond Host: ^example\.com
RewriteRule (.*) http\://www\.example\.com$1 [I,RP]

Our Check Server Headers Tool (with recursive results) will help you determine whether or not you've got it all set up properly.

Linking code can be seen as Canonical

We are having problems for this based on a link from a 8 PR site's tracking code usurping our domain as the base.

on links

>> unintentionally link to non-www version

When I link out, I usually link to whatever I have in my browser address bar. If I have typed in the address it will always be without "www."

I just don't bother typing these four characters anymore, and I reckon sites that don't respond to the pure domain name as broken.

If no 301 is in place I will link to the version I like best - - quite intentional -- and this version happens to be the shorter one. So, a 301 is also a hint to people linking in that one version is preferred over another.

Without a 301 in place, a linking webmaster would never know that you might prefer one version over another.

Thanks

Thanks, Lots0; I was just trying to clarify. And good points, claus. I personally dislike promoting the "www" version of a domain.

As well, my understanding is that www.domain.com is technically a subdomain of domain.com -- a "fix" from the days when computer functionality was such that www.domain.com, mail.domain.com, etc. had to be on separate computers. Today, even when the website and email are at different hosts, one can group www and non-www together via 301 redirects or mod_rewrite. Some hosts aren't too together in this regard, and *only* have the www pointing to the site, with the non-www 404ing out; how old school is that?

Of course, this type of discussion regarding promoting the non-www version of a domain led one of my friends to ask: "How will they know it's a website?"

Me: "You mean the dot com doesn't give it away?"

?Up until google had its

?Up until google had its complete mental collapse (Burbon) linking via the www or not was never an issue.

Using the www or not - I think it is up to the circumstances.

I do believe that most non-tech folk think you have to have it.

What?

It's been an issue for years now.

It's been an issue for years now.

Yes it has and one that has caused problems for many. At one point, there was split PageRank amongst the sub-domain and root domain. Google does a fairly decent job of merging the two together at some point. The most IBLs wins. So, if you have more IBLs to the sub-domain, that's what wins and vice versa.

blog issues

These recent changes to the canonical issue are perhaps most serious for hosted blogs. If you look at typepad.com, for example, that service assigns the canonical root to the "primary blog" but also requires all blogs to be in folders off root. Consequently, all primary typepad blogs have two URLs (blogname.typepad.com and blogname.typepad.com/blogfolder/). The Typepad engineers have configured the server to return 200ok response codes for both locations.

Since all single blogs are "primary" blogs, and all multiple blog accounts are required to have one primary, every primary blog hosted on typepad.com is at risk.

And on that note (blog issues)

You can probably start imagining other ways in which you can screw a competitor royaly besides the www, no-www way. These other ways are generally harder to prevent if you're using lots of mod_rewrite.

It's been an issue for years now.

agreed

Certainly before bourbon or whatever it is called.

Checking PR out of curiousity some time back only to find split pr for the www and non www versions. Hence my search for my first post above.

nuevojefe

Yep, I suppose it would be good to post them but there is aways that fear that people who had not got it figured out might start to do it.

I have a situation where I am using a mod rewrite and taking part of the url to make the query.

eg

/somevar1/somevar2.php

somevar1 and 2 are taken out using regex and used to create an sql query.

So in your query / sql logic you need to check for zero results, and serve the correct page:


$qry = mysql_query("SELECT * FROM table where .....");
$count = mysql_num_rows($qry)
if ($count == 0)
{
//do nowt and serve that page
}
else
{
//serve normal page
}

This would prevent people linking to hundreds of variants of the the url that might produce the same page, and as a result dup content.

There is more to it than that but you can see the risks.

Cheers

sad isn't it?

lets just all watch the level playing field slip a little further away....

sad isn't it?

I suppose it is, but without sounding out of order, that is life.

My parents talk of a time when doors could be left open without worries. Now they are concerned about being twatted over the head with a bat for the change in their pocket.

You need to be aware of such things and limit the rick to yourself.

Cheers

Parameters

ukgimp - a good idea when translating URLs into parameters is to serve a proper 404 when a page is requested that doesn't exist. That way you don't get lots of URLs pointing to the same (code 200) page, even if that only says "No such page here".

In PHP you'd write:

header("HTTP/1.0 404 Not Found");

404 vs 410

You may even want to consider using a 410 instead of a 404.

You may even want to consider using a 410 instead of a 404.

410 looks good

Quote:
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

Do we know if GOOG et al,

Do we know if GOOG et al, understand a 410?

thinking outloud

with the example abouve, if iy is your DB you should know if there are errros in your data, so instead of 404/410 you could serve a 301 and benefit from the link?

seems to easy???

Nick, they still dont get 302's right, so good point!

More Info on Google and 410 Gone

410 Gone

Read jdMorgan's comment on the use of 410 Gone in the above topic (message #6).

Nick, they still dont get 302's right, so good point!

Serve em' a 307. ;)

>>>What? It's been an issue

>>>What? It's been an issue for years now.

I guess I should have said, it was never an issue for me. ;-)

Do we know if GOOG et al, understand a 410?

I use the 410 quite often and as far as I can tell google does (or has in the past) handle 410s correctly.

There is more to it than

Quote:
There is more to it than that but you can see the risks.

Exactly. It's tough because the average programmer and seo are not ready for this type of assualt. Heck, we're aware and do nothing on lots of sites for lack of time.

Quote:
so instead of 404/410 you could serve a 301 and benefit from the link?

Could get pretty complicated and a lot of 301s could cause who knows what with each crazy update's side-effects. But damn, it is always nice to get the benefit of another link.

Google redirects using 302s

Something that intrigues me is that despite Google recommending the use of 301 redirects, Google themselves use 302 redirects for http://google.com to www.google.com.

#1 Server Response: http://google.com
HTTP Status Code: HTTP/1.0 302 Found
Location: http://www.google.com/
Set-Cookie: PREF=ID=9792121908a09e36:TM=1118792255:LM=1118792255:S=yNQio99PlPAzC4Qf; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com
Content-Type: text/html
Server: GWS/2.1
Content-Length: 152
Date: Tue, 14 Jun 2005 23:37:35 GMT
Connection: Keep-Alive
Redirect Target: http://www.google.com/

Great catch viz

Now that really makes a few alarm bells go off in my head.

GoogleGuy recomends webmasters use a 301 but they use a 302... WHY?

My guess would be...

...that GoogleGuy had no idea that 302 was in place. I don't want to defend Google's communications strategy...because I think it makes Stalin seem approachable, in retrospect...but I bet there are at least a handful of times a week when GoogleGuy throws his hands in the air and mutters the same, "WTF is going on here?", like the rest of us.

Dude is probably one TPS report short of a meltdown himself.

its the Adsense and Bourbon syndrome

Google Techies come up with what seems a good idea, have no real comprehension of how it will effect websites in general, impliment it, we all start screaming.

Occasionally something will happen to make Google scream too and realise it was a step too far - so the Adsense 'hijack' happens, GG claims is a bourbon update issue and they do a rollback, problem is fixed. Everyone else who would have fallen foul of that algo is reprieved because Google hit themselves and realised they were hurting the innocent. Of course they dont have the breadth of sites to suffer from most of their changes so we all get to suffer for 'em :)

Google are making decisions based on how things should be in a perfect world - they only have to look at how imperfect their setup is though to realise how real life is.

What they should have, and I presume they don't, is some real websites selling real product optimised and promoted in a real life scenario. I suspect if they had those they'd have a lot more understanding of how small changes effect the rest of us - which they currently only get when they happen to flick a switch that effects their own PR10 hugely naturally linked to sites.

Ok

So how many folks are busy today changing those 301 rewrites to 302's...

Not me, im happy with the

Not me, im happy with the way it is (now, ive fixed it, thanks everyone!)

I am thinking about changing

I am thinking about changing a few, just for the heck of it.

I don't buy the idea googleguy did not know google was using a 302.

you can bet your last cent, that all information that googleguy releases is at the very least discussed among the other high ranking googlites before he posts it.

As late as last week

As late as last week GoogleGuy posted that he had just learned how the reinclusion gig fully worked, even though he'd been dispensing that reinclusion url and protocol for about a year:

http://www.webmasterworld.com/forum30/29720-2-10.htm

So, while I agree with lots0 that 99% of GG's communication is calculated, there's so much going on at the 'plex these days that even he is going to get blindsided once in awhile.

I am keeping 301s but as far

I am keeping 301s but as far as the tracking stuff goes I have to find a way to not lose the code

301s in httpd.conf

I should mention (yet one more) brilliant post by Ron Carnell over at HR regarding putting 301 redirects in the server httpd.conf file.

as braindead as 301/302/307

This plays out just as stupidly as 301/302, et al.

The way to tell if non and www are one and the same is to look to the dns records.

The logic is as follows:

If a host record exists for example.com and a cname record for www.example.com points to example.com, then the dns zone under the control of the domain owner is stating explicitly that these are one and the same. There is no ambiguity here. Thus, it is not dupe content, but rather like getting to a building that happens to have two street addresses. The tax office is still on the tenth floor, and you still only want to pay the bill once!

The *only* niggle here would be virtual hosting introduced in http 1.1.

But really, same content coming from the same domain from the same ip marked as being equivalent in dns is dupe content? That is not a reasonable application of available knowledge. Knowledge that google can gain by structuring and interpreting their dns queries properly. If the rfc's are not sufficient, there are tutorials and source code all over the internet they can find. Provided they avoid the scraper sites.

Just how much aggregate work and cpu consumption is google causing globally with their failures to use information properly?

virtual hosting also use dns

A and C NAME - C ("canonical") being the alias for the A record.

Usually the A record is "domain.tld" and at least one C NAME exist, eg. "www.domain.tld". However, there is no requirement for C NAMEs, afaik.

So, what the Google people internally speak of as being "the canonical" is not really any particular one of the possible C NAMEs. Rather, it's the A record (which is as hard to find as doing a lookup in the DNS table).

For the DNS-challenged I should add that such a lookup is not hard to do. At all.

The solutition to the problem is so damn easy

>>>...google...structuring and interpreting their dns queries properly.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.