The Solution to Blog Spamming

16 comments
Thread Title:
Comment Spammers Have Blogs of Thier Own
Thread Description:

In the threadlink above, Jeremy Zawodny of Yahoo is talking about solutions to the ever increasing blog spam problem. Recently SixApart, makers of MovableType have been experiencing server load problems due the voracious appetite spambots employed by hardcore search marketers (spammers) in an effort to get top ranking in competative areas.

Jeremy wont link to the spammers site, maybe he means DaveN's site?

Jeremy's solution is this: Assuming that 80% of bloggers use the same major blog software and that 80% dont change the default templates, just have search engine spiders look at the code and differentiate between the original post and the comments. Dont count comment links at all.

Why that isn't a great idea...

I think Jeremy's solution is a poor one for a few reasons:

  • It will kill a good many great links - Comments are used for discussion of the original post and as with here on Threadwatch the discussion that follows often produces some outstanding links to great resources that the original poster never knew about. I'd hate to see those sites not get the full benefit of a link from us.
  • Computational overhead - Im not search engineer but im reasonably certain that comparing code on pages to look for MT (or other blogs) footprints and then weeding out the comment links would require a fair bit of extra computation and this may not be doable from an SE standpoint.
  • Im not convinced that the search engines should be responsible for finding a solution - im not saying that it's sixapart fault, just that they are in a better position to find a solution to this.

So, What's on the Table?

I think the solution lies with the software producers and that the company that comes up with the best solution and can demonstrate figures to prove that it works will have an excellent selling point for thier product. As it stands, MT's MT-Blacklist is crap: It's a constant "bucket and bail" effort that's reactive rather than pro-active and falls way short of being labeled a "solution". Other blogs, such as bBlog have implemented Captchas - where you have to enter the digits shown in a graphic to comment - this is better but, it's not unbreakable.

Here are a few of the current solutions being used or suggested:

  • Captchas - Enter the digits in an image to comment
  • Registration - To comment, you must sign up
  • Pre-moderation - The blog owner approves or dissaproves each comment before publication
  • Blacklists - lists of known spammers by site or IP or keyworkd regular expressions
  • Search engine filtering - as described by jeremy in the threadlink above
  • Non pagerank passing links - links from comments go through a non-spiderable jump script.
  • Bayesian filters - looking at the comments and determining if they are spammy or not

Let's have a look at each:

Captcha: This to me seems like a good solution, or part of a good solution. Captcha's can be broken but it's far from easy to do so the bar for comment spamming would be set very high.

Registration: Providing registration requires an email verification this would undoubtably be a good solution, or part of one. Again, it can be broken, but not easily. The major problem with this is that many bloggers feel that the quick and easy way of commenting on blogs is part of their appeal and this solution presents an obvious barrier to participation. Im of the opinion that there are too many bloggers that have a rose tinted view of the internet and how it works - they should wake up to reality and realize that it has never been, nor ever will be, an ideal world.

Pre-moderation: This sucks. It's worse than useless unless every blog out there (or at least 80%) have it built in as default. The server load would not slacken for some considerable time and the burden of blog spam would simply be put into a new area. IE. rather than removing spam, bloggers would spend their time pre-moderating spam. Not a viable solution in my opinion.

Search Engine Filtering: Well, as i said at the beginning of this post, i really don't feel that's the right way to go. It would kill a lot of good links and stop them from passing well deserved algorythmic recognition to worthy sites being discussed in context. Sorry Jeremy, that sucks :)

Non PageRank Passing JumpLinks: This sucks for the same reason given for search engine filtering given above.

Bayesian Filters: It's my understanding that these are piss easy to break. Please correct me if im wrong.

So, What is the Solution to Comment Spam?

To solve the blog spam problem you first have to understand why blog spam occurs. It amazes me that the majority of bloggers that complain loudly about spam seem to have no idea why they're being targetted. I wrote yesterday on the subject of why bloggers think they invented the internet and the points in that post seem to apply here aswell. Bloggers need to take a look at the world outside of the blogosphere.

Understand your Enemy
The blog spammers i know have an intimate understanding of all the major blog software, the blog boom itself and how it can be used to further their own sites advantage in the search engines. Bloggers need to take this approach also.

Let me help you get up to speed...
It's really rather simple: A link to a website with the right anchor text is valuable for search placement. Simple as that. By allowing users to comment and having their "name" as the link text that goes to their website you are enabling them to sign is as "v|agra guy" or whatever - the keyword in the link text is what is important.

Contrary to popular beleif, it doesn't really matter if your blog has a high PageRank or not, the anchor text is what is important.

With me?

Now you're up to speed, let's look at what I think is a good solution and ask the good boys and girls at Threadwatch to dissect it, poo poo it, agree with it or present better solutions:

And the Winner is....

It's not an ideal world, for those bloggers reading this that think that if they wish hard enough people will stop spamming their blogs, or that perhaps whining about it is a solution as opposed to altering the way comments work then wake up! It's not a pink fluffy internet out there and if you want to do something permenant about this then you're going to have to change a few things okay?

  • Change the way comments work - Instead of having the "name" part of the comment form as the anchor text that links back to the commenter's website - substitute it like this: Have the link go through a non-spiderable jump script. (yes, i know this goes against what i said earlier, but bear with me...) This will allow users to click the link and go to the commenter's site but not allow any benefit to spammers.
  • Allow HTML/BBcode in your actual comments - Yep, you heard it right. This will allow users to link to on-topic material and add value to the post - yes, it would allow spammers to just insert html into their comments, but again, bear with me...
  • Use a Captcha system for those that comment OR require registration. Arguably the first option is best as it presents the smallest barrier to participation. This will stop the vast majority of automated spam - period. And Enable it by Default!

There, that wasnt so hard now was it?

What Problems will this Solution Present

Not too many I think. The main issue is getting bloggers over the idea that it's an ideal world out there. You WILL have to change some stuff if you want this to work, just accept it. You will also get very clever bots that can break your captcha's but having spoken to some tech guys about this, it would be very minimal.

The worst thing would be that you would have to police your comments. You will still get people submitting comments by hand with links to thier website embedded in the comment field. Some will be clever and hard to spot as a spam attempt or genuine comment, well, that goes with the territory im afraid. Forum owners have to deal with promo posts on a daily basis, I have to here at Threadwatch and so will you - You just cant do this without a little work.

In Conclusion

Untill search engines work differently: IE they dont place as much importance on link text or they find a way to determine spam from genuine the idea of auto-spam links will not go away. There is already talk of Wiki Spam and blog software developers will need to look hard at their systems in regards to where else spammers might find a way to put live links in. As i've said, this wont go away without work.

Blog Software is Where the ultimate Responsiblily lies
It will be down to companies like SixApart to initiate changes in their software to thwart spamming - They're already working on it. Solving the problem for you personally by hacking your blog script wont help much on the whole - neither will plugins or tutorials etc - This stuff needs to factored into the the core scripts and Enabled by Default.

Go on, Tear it Apart...

Ok, so pull my post to pieces please. Let's have some thoughts on my suggestions, some pointing out of anything i've missed or why the whole thing is a big bag of shite....

Small Disclaimer
It was pointed out to me that this might be seen as an attack/dig at jeremy personally - it's not, i dont think his solution was a good one and that's where it ends :) The remarks about bloggers "waking up" etc were aimed out in general, not at any specific individual - thanks...

Added: John Battelle has just written about jermemys post also, worth a look...

Comments

So basically...

...you mean that all blogs should be like TW (which in fact in more like a blorum, but I won't rake that up...*)?

Personally I'd JS the links, so people interested could click on them, but the value for PR would be null. Why want PR from a quick and easy blog post. I agree you should allow people to check out your stuff (i.e. with a link).

That said: your solution has a lot going for it. It would turn every blog into a blorum, making more money for the owner of blorum.com - hehe. * Oops.

Maybe you've already said it

Maybe you've already said this and I missed it 'cause my children are already hyper about xmas and are twittering away in the background making it difficult to concentrate, but one of those little images with letters on which must be input by a real person would be the simplest option to implement against the bots.

Ah, you have

Ah, you have ... captcha.

Geeze, don't teach your baby to speak; you'll save yourself a lot of trouble and backchat!

Personally I think captcha is

Personally I think captcha is the best solution. Nobody likes registration, and manual spam isn't so much a problem.

bBlog 0.9 ( currently in beta ) supports out of the box:
captcha, pre-moderation, turning of comments on a post after a set time and pre-moderation only for posts with links in them.

With the captcha I havn't had a spam in 4 months, after getting a few a day without it.

Side note: the new version also suports custom URLs out of the box, e.g. /blog/my-keywords.html helps with ranking for such competitive terms as 'threadwatch' ;)

hehe...

yeah yeah...

I like the captcha solution aswell, it does seem the sensible course but i do think it should be coupled with some changes to how the whole comments system works as laid out in my first post...

I love CAPTHCA

It will definately stop the majority of spammers from doing their work but will allow those that have developed a way of *using it to their advantage* a free reign. If used sensibly the overall spam will be low enough to manage yet great enough to rank allowing those few to dominate in a simple manner.

As to the other protection methods I feel that useability restriction will stop them being employed.

Vive le CAPTHCA I say :D

The blogger should be responsible for comment spam

I do not like Jeremy's solution either and it does not seem like a dig. Comment spam should be the responsibility of the person who set up the blog, imo. I know that a good many bloggers may not have the tech background/interest/desire to manipulate blog switches or templates. So a little more education on how to stop comment spam for each platform would probably go a long way to helping slow the problem.

I use a blacklist made up of ip's, url', and various regular expressions that are popular like 'online poker', its called WPBlackList2.8 for WordPress. It delete's comments automatically and can harvest comment spam info from each comment it deletes. Since installing WPBlackList I have not had to manage, ie delete, a list of comment spam.

Accessibility

My concern with captchas is their negative effect on visibility. To work effectively they rely on being noisy images, which poses immense problems for those who are partially sighted and using a regular web browser, let alone those browsing with a screen reader.

Certainly they work well (though if they are universally adopted, how long will it be before image detection software gets round them?) but at the same time they undermine what is for me one of the key principles of web-based communication: accessibility.

On Accessibility

My concern with captchas is their negative effect on visibility... but...they undermine... accessibility.

One possibility is to have a means to insert an audio tag, and have the audio be a constructed sound. MP3s can be built simply by pasting them one onto the next one, so one could simply record the digits 0 to 9, and the letters A to Z, and create an MP3 consisting of the letters to be typed. Then just paste the files containing the sounds together and deliver one new created file containing the whole word as a new sound similar to the way GIF images are created from scratch. (If you don't want to use MP3s you could do OGG or WAV files).

The only real problem is supporting audio files as a link tag in an image, I'm not sure if the support is there. I think maybe I'll write an internet draft proposing them.

concern with captchas

LOL, that's not a fix, for example if I was to download the image to the server convert it an ocr READABLE IMAGE .. then sign the blogg would that not over come the that protection method ...

cough : jocr.sourceforge.net

DaveN

>captcha Agree w/ Daven, I

>captcha

Agree w/ Daven, I've seen that cracked already, though not in blogs.

We know it's been cracked but...

Together with some of the other methods i mentioned it would raise the bar significantly

The point being that at the moment, building a simple blog spambot is simple for anyone with a reasonable programming background - i dont count myself a great programmer but the one i did was quite sophisticated...

If you raise that bar, the problem would be less of an epidemic and more of a nuisance..

Access

Hi jystewart, welcome to Threadwatch! please introduce yourself

I agree in principle with your concerns on accessability. As a visually impaired user I can tell you from first hand experience that those things are a blood nightmare - so your point is well taken.

It is still however, one of the few (if only) really viable methods for halting (vastly decreasing) blog spam.

Do you have any way around the access issues?

Access

Thanks. I've posted an introduction.

The simplest ways around the access issue are likely to just make the spammers' lives easier. It's no good using alt text or titles for these elements! One possibility would be to offer audio or visual options as that would cover the vast majority of users, but producing the audio files would likely eat up a lot of processor time (the load would be worse than the issues we're all currently facing from CGI usage, I guess) and in the end audio could be cracked just as images can be.

I suspect that in the end any solution is going to reduce simplicity and accessibility for sites and you're right that captchas could be offered as an alternative to creating an account. If both options are there then at least people _can_ use the site, even if it's awkward. My main reason for posting as I did is that the idea of captchas has been thrown around a lot of late (particularly on SixApart's ProNet list) without pause to consider the accessibility issues.

DaveN thats an OCR program, t

DaveN thats an OCR program, the images created are designed only to be human recognisable. It's doesn't have to be words even. I.e. Who is this: with a photo of some famous person. Or how many cats are in this picture. etc. I don't think bots can pass this test : http://www.captcha.net/cgi-bin/esp-text

As for accessibility issues, alt text of "if you cannot read this image email webmaster@ to post your comment" on the image is a workaround, not ideal, but it still provides access.

hahaha

when the picture says...

"Sur la Table a Cook's Paradise Free shipping on selected items"

I agree that would be hard to filter but then again how many blog owners will make people jump through those hoops...

also I ran it by my script and got a 75% pass rate... which was better than I thought, also when I tested it by hand I got a 15% fail rate even when the word was it the picture !!!

DaveN

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.