Rumour - Google about to Kill Comment Spam

23 comments
Thread Title:
rel=nofollow
Thread Description:

On Friday we reported on a strange post at Dave Winers site that said:

Last night I got an email from someone I've been wanting to hear from for a long time. There's a problem on the Internet, a big one, that only one entity can solve. The email outlined the solution and asked what I thought of it, and asked me not to say what it is publicly.

He went on to say that he had implemented the idea on one of his sites. Well, Simon Willison thinks he may have discovered what this is...

Google to Quash Comment Spam

Originally i had followed Todd at GeekCentral's surmise that the mystery email was from either Steve Jobs or Bill Gates - but Simon has spotted this on Dave's Bloggercon site. Check out the comments link and view the source!

<a href="someblogsite.com" rel="nofollow">

Which, if Simon is correct in saying:

Google are soon to announce that they won't be calculating PageRank for links with a rel="nofollow" attribute. Finally, an official way of fighting the economics of comment spam by denying PageRank on user-submitted link content.

would eventually have some effect on comment spam as a technique for rankings.

If true, would it solve the comment spam issue?

I think not. There are many reasons why this would not work. In fact, we have talked about the solutions available a lot in recent months and i hold to my original point: You need to stop automated commenting - not deincentivize it!

Here's why:

  • The time it would take for this to proliferate to any great extent is HUGE
  • Even after a couple of years, there will still be tons of blogs that dont carry this tag for spammers to target
  • It's easy to check a page for that string before bothering to comment - this is good for bloggers, as up to date systems may suffer less spam but there will still be lots of good targets
  • All of the above means that for a prolific spammer, there is still great incentive to spam, and it's arguably easier not to even check for the tag and just live with the fact that some of your comments wont count.

However, it wouldnt be a bad start...

A cooperation between blog vendors and engines would be a reasonable start. Maybe in a year or two the problem would lessen to a manageable size.. im not sure, but it's a reasonable solution and certainly a better one that adding a tag to the HTML set as suggested by Danny. Sorry mate, i still think that's a lame idea :)

What do you think?

Comments

Cool

Would save ressources to the good blogs where nobody cares about spamming. Any chance of having sth like an "-rel:nofollow" operator for Google? :)

You missed my point

Nick, what I said was that an ignore tag might be one of many possible solutions the search engines might consider. Now, a nofollow attribute for a link itself isn't that much different from the idea I also said in my post that people might surround certain links themselves with an ignore tag. So rather than ignore being lame, sounds like you're agreeing with me :)

But the bigger point is this. Publishers could use more ability to mark up their content for search engine purposes. We have little ability to do that now, nor have the search engines given us much over the years. It's long overdue for them to consider some type of options -- ignore, enhanced nofollow, whatever. I don't care or know the perfect solution. I do know they need to provide more to us, and that they should do it in a coordinated fashion. And should it turn out Google really is about to unveil some new attribute, it's actually bad. Why? Because what about Yahoo? What about MSN? That was in my original post as well -- we have had them all do a few unilateral things, when what we really need is consistency across from them.

hmm.. was just following this rumour around...

Ruminate also posted this theory earlier too. I wondered if the word "entity" may have meant something other than a person, note though that Dave also says he's only implemented it (whatever it is) on one of his sites.

I agree it would take some time to permeate, even supposing that is the answer to this mystery ;)

Also is there not a javascript equivalent way of doing this anyway, either way could be written into the popular blogging software could it not, thus even unattended blogs would be "defended"... (older ones would still be game though..)

just first impressions though..

If true, it would be like try

If true, it would be like trying to undo screws with a sledgehammer. You could javascript comments and that would do the same thing. Search engines creating tags for it is missing the point, IMO.

And even if the commenting were taken on board, so long as blogs exist in number that do not implement such a thing, then comment spam will not at all be stopped - in fact, I'd argue it would invigorate blog spamming, as users use more aggressive techniques for finding the un-tagged blogs - - - even if that means hitting a wider number of tagged blogs - imagine worms created specifically to search Google & other search engines on the scale as like Santy did. IS that what the internet wants to create?

Ultimately, problem isn't how search engines index - it's about publisher responsibility.

You know what - someone spammed my Platinax News blog with 100 trackbacks to porn sites earlier this week. Now get this - I didn't whine about the blog software not being secure enough, and I didn't whinge about search engines not giving me no index tags - I simply deleted the whole disgusting mess, and made a note of the Russian kid's name for future reference.

That's publisher responsibility.

That's tackling the issue, but you don't get rid of the problem until publishers in general take full responsibility for their publishing, and software developers aid the publishers by allowing extra precautions against the spam - registered only users is a simple start to protect any blog.

Such a tag, or any tag for th

Such a tag, or any tag for that matter won't stopp comment spam - it will just make some of the links not work. Honestly, if you coment spam 50.000 sites a day you don't care which ones count and which ones dont as long as the percentage of links that works is high enough. It's like any other spam in that sense - it's a numbers game.

So, please go ahead and add that tag to your (anyone) blog if you have spam problems and see how much it helps :) Personally I won't bother to check it before I link-spam you

Ignore tags

Quote:
So rather than ignore being lame, sounds like you're agreeing with me :)

Not really, but I am ever so slightly closer :) Danny, you can't just go adding tags to the HTML specs - remember the <blink> tag and other such nonsense? The specs for XHTML are very good now, and are being widely adpoted by blog and cms vendors which in itself is a major step away from the HTML3.2 that people like Tabke and Teigtmeier insist are good for search engines (refuted by GoogleGuy at SEW) - adding additional markup that's not in the spec goes against all HTML's intended use and would eventually cause problems.

That's a debate for a design board though :) However, if someone can be bothered to trawl through the w3c.org and prove me wrong please do but im pretty sure you can add a rel="nofollow" or anything to other tags like <div> so you could implement your idea within the existing framework of XHTML specs.

Quote:
I do know they need to provide more to us, and that they should do it in a coordinated fashion. And should it turn out Google really is about to unveil some new attribute, it's actually bad. Why? Because what about Yahoo? What about MSN? That was in my original post as well -- we have had them all do a few unilateral things, when what we really need is consistency across from them.

I dont think that the search engines should provide us with anything in that manner, and i dont think we need it either. It's the search engines job to ensure that their indexing works and that they provide relevant results to users. The provision of arbitrary markup in order to better index would create a massive (even more massive) gap between those in the know, and the far greater majority of those not.

The engines need to work out ways in which to distinguish ads and undesireable links by algorithm alone. They're already making headway with this with block level analysis right? Soon, if not now (to a certain extent) links in sidebars, footers or headers would be deemed less important as they are not part of the real meat of the page. Correct me again if im wrong but that's my understanding of it.

As far as the rel="nofollow" - to a certain extent i'd be forced to follow my own argument above and throw this idea out aswell, however, in the context of blogspam it's a little different: Firstly, it would be put in place by blog software vendors so eventually it would proliferate to a wide extent and have some effect. This means that ordinary joe blogger would not have to be involved, good stuff. The comment spam problem, whereas we might joke a bit about it in seo circles is massive, and causing a lot of grief for an ever increasing group of poeple on the web so this might mean i have to bend a little and support such a move. That's okay though, i've never promised to be consistent :-)

As for unilateral agreement between the SE's - that's moot point really i think. If MT picked it up and initiated it, followed be either of the big SE's the others would be fool not to follow but i do agree that it would be best implemented with full agreement from all major parties - that includes the SE's and the software guys.

So, we're a little closer Danny, it's just the details that we disagree on...

The real issue

As Brian and Mikkel pointed out above, not even this is enough.

The only real way, and i've said this all along, is to stop automated commenting! - there are a number of ways to do this, not least of which are the standard introduction of captchas and changing the way that the comments are presented in the first place - and yeah Suzy, JS is one option...

For any that missed it, we had a good discussion on this here and followed up here

Yep

Blog spamming is a problem where you allow comments. We've implemented a few fairly successful methods and are moderating comments now, which doesn't seem to stop the pill-pushers (I must say these people are not only boring but tacky). We'll probably introduce captchas (those enter-the-numbers things) in the future; I also saw a *great* technique recently for stopping such spam.

captchas is broken

captchas are already solved for the good blog spammers.

Yeah

But that's just splitting hairs thomas - you will never, ever rid the world of spam when comments are allowed - raising the bar so that it's not so ridiculously easy though is a good start..

Well, have to disagree with y

Well, have to disagree with you, Nick. The search engines do indeed need to provide publishers with some better ways to have content indexed. Got a problem with robots.txt? Probably not -- but that's an example (one of the few) where search engines responded to publishers concerns and gave them an option. The meta robot tag? Same thing -- and extended by some of the search engines to respond to publisher concerns about photo indexing.

I'm not saying that search engines should rely totally on tags. They can't. People misuse them deliberately and accidentally. They'll still have to do automated analysis.

But the idea that the search engines should just sit back and wait until some HTML specs are extended? Sorry, the W3C has been absurdly woeful in helping there. What specs we have either came out of the search engines themselves getting together back in mid-1995 or there are these proposals that the search engines themselves were never involved with -- so they don't adopt them or use.

I don't care if it's tags, extension of XML, HTML -- whatever. There are publisher needs that the search engines are not dealing with, and it would be nice if they could finally come together to perhaps advance the state of indexing in a coordinated manner -- not just to help the blogging side of the web, but the entire web. Publishing issues goes well beyond comment spam :)

Correction

Quote:
But the idea that the search engines should just sit back and wait until some HTML specs are extended? Sorry, the W3C has been absurdly woeful in helping there.

I didn't suggest you should wait for the w3c - as far as im aware the whole point of html is to be device agnostic - and that includes search engine spiders.

The meta tag uses a pre-existing html entity for a specific purpose - it's what the meta tag was invented for: For adding meta data. This is/was the search engines responsibility (perhaps suggested by users of course) and was implemented in a standards compliant way. Good call as far as im concerned.

The w3c should not be helping here. However, there are ways to use the existing, device agnostic markup to do this - ie. using the rel attribute.

Just a correction to your point danny, we are actually thinking along quite similar lines but it does appear that like many seo's your knowledge of such mundane things as markup is lacking heh..

Nick, I think you have to loo

Nick, I think you have to look at things in a broader view than the the sites you operate, or are focused on. My first comment was mostly in regard to the headline of this thread and I stand by that. I don't think such a tag will kill much comment spam.

However, I must agree with Danny that there are many publishing needs on the web and many corporate wishes, local laws, standards etc to take into account. And Danny is right, the degree of control the publisher have today is not good enough. If you want to exclude parts of a page for indexing today you can do so - just cloak it - but then you will be violating engine's guidelines and risk a penalty. Thats stupid if all you want to do is secure an indexing that fits your needs, local law or whatever.

See my earler posts mikkel

Im not saying it cant be done, im not a fan of it but have no objection particulary - it should just be done in a way compliant with xhtml standards so as not to intefere with the many other considerations outside of indexing. using a rel attribute on elements achieves exactly the same purpose as a specific tag - it just fits with the standards as opposed to breaking them :)

Correction

Hmmmm.. i took the time to look it up now - you actually couldn't use the rel attribute on anything but an anchor - the best way other than that would be to use a css class.

The simple fact remains though, assigning arbitrary tags to html without going through w3c is not helpful at all.

I agree, Nick, that off cours

I agree, Nick, that off course it should be, has to(!), be implemented in such a way that it does not intefere with any other standards. Naturally!

I think that the term "tag" is just used in this discussion to illustrate the need for some publisher controled code that can be used to adjust indexing. I do agree we need some better standards here. Lets face it, robots.txt and META-robots tags are just ot good enough!

Personally I don't care much about the little technical details. I don't care if it's implemented as part of the robots.txt, as rel properties (if allowed), div-tag propertise or CSS - just as long as publishers can better communicate indexing needs and limitations to engines in a standarized way across all major engines.

My point is just that this dosn't relate at all to the headline of this thread :)

Right

We have drifted far away from the original topic, thanks danny! grrrr... hehe...

However, im not too worried about that. Are you? :) I like to take a tangent and this one is about as close as you can get to the original subject whilst totally going off down a different path heh..

What about noindex Abuse?

Wouldnt we see a whole heap of new trouble from having a way to keep search engines out of certain areas?

I can hear the cries now: "Spammer! He's got off-theme content in his noindex section!" - "Spammer! Hey, this guys is selling stuff and not letting the search engines see it!" ---- more whining to come :(

Whining has allways been arou

Whining has allways been around but have never had a major impact on development of search engines, spam filtering or standards. Nor should it :)

Compliance

I know my knowledge of HTML and markup is apparently lacking, especially when the documents and resources about them use all those darn big words that can be hard to understand. But...

The rel attribute for links appears to have been introduced as part of the HTML 3.2 spec, http://www.w3.org/TR/REC-html32.html with some proposed values.

HTML 4 defines some "recognized" link types, http://www.w3.org/TR/REC-html40/types.html#type-links, though who these are recognized, I don't know. I'd assume the W3C in its infinite wisdom is recognizing them. The XHTML 2 spec seems similar, FYI: http://www.w3.org/TR/2004/WD-xhtml2-20040722/mod-metaAttributes.html#adef_metaAttributes_rel

I have a laugh at the specs because how they say things like "search engines..may interpret these link types in a variety of ways." The majors haven't been doing anything with them at all, to my knowledge. They probably weren't even consult with the link rel specs were created, given the W3C's track record.

I mean, I always enjoy things like this document about links and search engines: http://www.w3.org/TR/REC-html40/struct/links.html#h-12.3.3

It tells us how search engines will use links to find things like your documents in alternate languages. You remember the major search engines all talking about how they support that type of activity, right?

Oh, but wait, maybe the tips on how search engines will index your site are more helpful, http://www.w3.org/TR/REC-html40/appendix/notes.html#recs.

Um, some is -- but stuff like the language support isn't and has, like W3C stuff in the past with search engines that I've researched, come out of the blue by some wishful thinking by those creating the guidelines rather than from actually talking to or working with any of the major search engines.

Having said this, the specs say that document authors can expand the list to use any rel specs they want for links. If they do this, they are supposed to provide a profile of this info in their header area. Whether each author has to use its own profile or instead reference some profile somewhere is unclear.

So...if Google does this...technically if those authors using the attribute linked to profile info about it that Google also provides, it would meet the HTML 4 standards and effectively be going within the W3C's recommendations just fine.

And if Google does it, then the other search engines might follow. Or not. Authors will use it if their tools support it. For the bloggers, no doubt Google support will get added -- another reason the other search engines will likely trail behind and do the same. Put it in FrontPage, and non-blogging web authors (there are a few of those) will do the same.

By the way, meta tags are not a search engine's responsibility. Meta tags were introduced as a way for all types of information about a document to be embedded with a document, if we're talking meta name tags (http-equiv types of meta tags stand in for server based data that might otherwise be reported).

Anyone can create a meta tag value -- the W3C did not nor still does not to my knowledge approve any particular values. Dublin Core has a ton it set up and recommends. We have some search specific ones that the search engines came together on (sort of) nearly 10 years ago as I mentioned in my original post.

Now if you fly ahead to XML, the point of that was that anyone -- anyone -- could define a set of tags that could be supported in anyway they want. You'll have to correct me if I'm wrong, because of course all those big design words tend to scare little ole me.

But take an RSS feed built in XML. No one ran to the W3C and asked if they needed to approve some of the tagging that is done within it. Instead, you get a group of people to agree on a format, and bang, use the tags (RSS 1.0). Or don't agree with that and do your own thing, if you're another group (RSS 2.0). Or don't agree with that and come up with a new name (Atom).

So the search engines, if they want to introduce an XML tag that could be incorporated within a document would be well within any W3C compliance, as far as I can see.

Wow danny...

That's a lot of effort to go to..

You appear to be missing what im saying though. Let me try and answer some of the more important bits one by one - then im off to bed :)

Quote:
I know my knowledge of HTML and markup is apparently lacking, especially when the documents and resources about them use all those darn big words that can be hard to understand. But..

Steady on tiger...

Quote:
I have a laugh at the specs because how they say things like "search engines..may interpret these link types in a variety of ways." The majors haven't been doing anything with them at all, to my knowledge. They probably weren't even consult with the link rel specs were created, given the W3C's track record.

Why on earth would the search engines be consulted? Search engines are not what html is or should be designed for...?

Quote:
By the way, meta tags are not a search engine's responsibility. Meta tags were introduced as a way for all types of information about a document to be embedded with a document, if we're talking meta name tags (http-equiv types of meta tags stand in for server based data that might otherwise be reported).

well, my apologies for being unclear - i did not intend for it to be read as though i beleived search engines invented meta tags heh... they did invent the =robots bit though right? that was my point, see below. (the word "robots" should have been the second work, my bad..)

Quote:
The meta tag uses a pre-existing html entity for a specific purpose - it's what the meta tag was invented for: For adding meta data. This is/was the search engines responsibility (perhaps suggested by users of course) and was implemented in a standards compliant way. Good call as far as im concerned.
Quote:
So the search engines, if they want to introduce an XML tag that could be incorporated within a document would be well within any W3C compliance, as far as I can see.

Not to my understanding no. They can extend markup all they like, but then it will cease to be XHTML1.0 or any of the other standards compliant and it will be it's own entity.

I have it under good authority that another member has some specific responses for you on HTML/XHTML and W3C - Im so releived i may just have to go make a cup of tea to celebrate hehe...

Why Consult?

Why consult the search engines? Well, the W3C does things like:

1) Gives people some attributes to use such as language or alternative media types

2) Tells people to use these things so that search engines will index them in a certain way

3) Then fails to acknowledge that none of the major search engines may (or actually does) act the way the W3C says they will.

I've long, long written that search engines were like the third browser, from back when designers used to focus on IE and Netscape. Search engines, of course, view documents differently from those browsers.

Now, the W3C is certainly going to try and include someone like Microsoft when it drafts standards on how to create web documents. Microsoft may not follow those standards -- but gosh, they do some attempts to reach out and consult.

The search engines have no been involved in any such process that I have heard about for the longest, longest time. Why wouldn't you want to involve them? I mean, the W3C has material it already offers to help web authors get indexed. Might be nice to actually provide tips that the search engines themselves have helped craft.

I'm obviously giving you grief above for ribbing me on my coding knowledge. Nope, I'm not a coder. I don't live and breathe HTML/XML and look forward to what your other member is going to post. But the idea that the search engines should be involved in whatever page language we're talking about doesn't require coding knowledge. They index billions of these documents. You want them to be involved at the core -- they should be -- because it offers a lot of advantages to us if they would be. We might have more control over how we are indexed. And yep -- yep -- yep -- that control can be abused. So what. Things are already abused. Personally, I'm tired that because of fear, we don't get more control.

Members

A reasonable point danny - perhaps the search engines should cough up the $57.000 annual fee like M$ and others to become full members?

This way they could at least have a say...

Hell, even the smaller SE's could get in as affilate members at $5,700 and pick up some nice PR9's for their trouble hehe...

they're only recommendations...

Well this has deviated a bit .. and all from a rumour too :)

To address all those W3C comments, let me say that the W3C provide recommendations not specifications, which is picky yes, but the two words provoke very different ideas. It is also why they say "may" a lot, there is a lot of suggestion and flexibility in there.

Things have matured quite a lot since the beginning, and in the absense of any other guidance at the time I would say they have done very well.. up to date and even the Metadata and Resource Description (covers PICS indexing data) had been superceded by the Semantic Web Activity..

The working bodies, comprised of corporate entities, enthusiasts and volunteers alike, together have helped shape the recommendations so far and up to now have managed to evolve with the Web and beyond the basic bickerings of "we werent consulted in the first place"..

Google for example didn't exist back in the beginning how could they have known what they were going to need, Who did know what Search was going to evolve into? Even TBL says:

Quote:
The most exciting thing about the Semantic Web is not what we can imagine doing with it, but what we can't yet imagine it will do. Just as global indexes, and Google's algorithms were not dreamed of in the early Web days, we cannot imagine now all the new research challenges and exciting product areas which will appear once there is a Web of data to explore.

.. it certainly doesn't make it any easier for the Search Industry, at this point as I think this is one of those "research challenges" that every industry involved in the Web faces, and in all industries we are having to deal with legacy/outdated/bad ideas, code and practices. Only time and perseverance will solve that.

If they were to adopt something within the RDF, then it should enable their business model to contine to evolve, enable them to remain at the forefront of their industry, enable them to have a steering influence, and also enable authors to implement it more widely. Authors will likely not be very trustful in implementing any guidelines/recommendations from one single source (i.e. Search Industry or Google alone) as they have learnt in the Web's relatively short life that things change very quickly and no one company (not even Microsoft) should have that sort of control, I think they would be happier to stick with established bodies (W3c or Dublin Core). So they need to work together.

Quote:
But the idea that the search engines should be involved in whatever page language we're talking about doesn't require coding knowledge.

Yes they do need to be involved in the language, and I disagree they do require coding knowledge else how are they going to know what bits can be abused or not (they should've gained that knowledge for free by now anyway!) or maybe they do want to land up in another "lets just step on everyone elses toes until someone who does know how to manipulate code comes along and breaks our toy" scenario?

That Language should not be (X)HTML, that is (or should be) in the past for them and is way too small minded ....

They (or their spiders anyway) need to be involved in the machine understandable metadata languages, because after all what they're trying to is use machines to Harvest information from links..

At the minute they are doing this by trying to use simple HTML anchor links supplemented by algorithms, but as we can all see the anchor element is just the latest in a long line of HTML elements which have been abused becasue of them. If the Search Industry remains focused on (X)HTML .. it will likely be their loss and will only lead to more disgruntled behaviour from publishers, designers and SEO's alike. (Spammers though would likely remain the happiest.. )

If they were to focus on the metadata languages it would also take care of any so called "responsibility" to publishers because Publishers will know exactly where they stand too, don't they have a responsibilty as well? They will either learn to implement the language, buy something or employ someone to do it (A new breed of SEO perhaps?) and then get on with their own industry/part of the Web.

It needs (probably already has had actually..) much more surface thought than just using XHTML/CSS attributes. Designers are getting just as annoyed as other publishers that some groups of people along with the SE's are spoiling their industry by making them unable to legitimately use some their languages established attributes for fear of getting caught in the mess that is MarkUp Abuse..

Meantime of course they and we will have to live with their legacy, so they will likely have a lot of backlash to deal with at some point in the future (Microsoft springs to mind here). Before they can expect some respect for future propositions or actions they need to take responsibility for their own mess and show some respect by communicating with Publishers.

They will have to continue using XHTML for a little while obviously, though I think it's a very honorable step that they would choose to use an existing attribute, rather than creating a new element to take that first step. It would be easy to implement and doesn't interfere with existing pages.
Although it maybe won't work in the context it's being linked to in the rumour, I would take it as a sign that a first step is being taken.. but will they say that?.. and if so will folks accept it as such or is it just going to degenerate into another slagging match?

Also if Google/Yahoo/MSN is responsible for this first step, is the rest of the Industry *big* enough to sit down around a table to build on this without sulking that they weren't first? They all obviously need the time to keep the noise down until they and the publishers can catch up with each other in order to sort out/compromise on ALL aspects of document code.

I really do hope this is a first step towards the Industry as a whole incorporating other metadata.. and letting the Web and all its data evolve beyond more than comment spam.

Just for info..

Quote:
HTML 4 defines some "recognized" link types...though who these are recognized, I don't know. I'd assume the W3C in its infinite wisdom is recognizing them.

There are a lot of applications recognizing them, not enough perhaps but still it's growing, not least of which accessibility software and text browsers, Opera renders them, you can get a FF extension to use them (I've got a nice little toolbar at the bottom of my page which means I don't need your on page site nav.).. anyhow theoretically the SE's could use them too if they wanted to.. maybe they are already, to navigate sites easier, It's sure gotta be easier than wading through all that HTML junk.. ;)

anyhow I digress and the topic is now way off.. (sorry Nick),
I think my point is the Search Industry do not need to create anything new just because what they've been using doesn't work very well any more (or has been abused to death ~ get over HTML3.2/4), just take a look at what else is already there, there may be no need at all to re-invent. Sure, if after you've looked and discussed, if you don't think it will scale then you can make it better if you want (get spokespersons on those working bodies!).. but LEARN IT inside out and use it well, don't let it get outwith your business model's control like HTML obviously did, don't let me read again another 10 years time that you weren't consulted.. get out there and make it your business!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.