Spam on MSN Search...err..MSN on Spam

SEO by the Sea has a great post about spam analysis done at MSN. While MSN's credibility on finding web spam might be a bit questionable (since they rank it so well), but the research ran through a variety of factors associated with web spam. Most were related to temporal page variance and running things like inlinks and outlinks through power laws. For TWers who hate to read research pages Bill also mentioned this 36 minute video.

via Peter D

- Y! MyWeb

Very Interesting

But why are they publishing these finds in such a detailed document? Is this going to show auto content generators how they need to advance in their quality and randomness of content?

Could they be publishing it in an attempt to see who bites? Or do they have too much time on their hands?


@ RickStar:

The content of the papers is mostly 2+ years old ;-).


Some newer research there

Thanks, Aaron.

Most of the papers are at least 2 years old, and the patent/patent application process can be a long one. I wanted to try to provide a backdrop for that document, which is why those are so old.

But the last paper in the post, Detecting Spam Web Pages through Content Analysis (pdf), is much newer, and those other documents act as a good backdrop for it. A few of the citations at the end of this latest document are dated from August and September of last year, and it's being formally unveiled at WWW2006 towards the end of May.

But why are they publishing these finds in such a detailed document? Is this going to show auto content generators how they need to advance in their quality and randomness of content

A very good question. Economics may be presently on the side of those who can automatically generate content, and pages that can skew search results. The last paper notes:

Effectively detecting web spam is essentially an "arms race" between search engines and site operators. It is almost certain that we will have to adapt our methods over time, to accommodate for new spam methods that the spammers use. It is our hope that our work will help the users enjoy a better search experience on the web.

Victory does not require perfection, just a rate of detection that alters the economic balance for a would-be spammer. It is our hope that continued research on this front can make effective spam more expensive than genuine content.

While sharing this information in a widescale manner enables people who would spam search engines to see some results, I'm positive that a number of their methods and conclusions are left unstated. But, if there's enough there to attract more people to study and undertake the effort to fight web spam, that may be the main purpose for sharing this information.

I haven't looked at the other two papers yet that are going to be presented at the WWW2006, but the titles are interesting.


funny analogy to "the family"

I am reminded of a movie about TheFamily™ where two-bit hoods in town were tolerated as long as they didn't steal from TheFamily™. In this movie, a coupla hoods hijacked a trailer full of smokes, and were later educated about how they might enjoy better long term health if they checked in with their local FamilyMember™ prior to executing such heists...just to be sure their efforts were "copacetic". The two-bit hoods comprised a farm system for the bigger leagues. Those who listened and cooperated got adopted and those who didn't...well...

So here we have TheBigG™ managing a wealth creation system that rewards webmasters (AdSense). The two-bit hoods (MFA spammers) conduct heists that hurt TheGFamily™, so the Don sends a messenger to the local pasta mill to "have a chat" wit dem. You can do scores, but yous just needs to check in wit me foist.

In the end theBigG manages a distributed network of hoods who are rewarded for their cooperation (via AdSense) and terminated if they steal from the wrong Family. What do the other families have to do to counter this extreme competition? Pretty simple, actually. Either provide bigger rewards[1], find a way to undermine TheFamily™ system[2], agree to some profit share agreements[3], or start accumulating Tommy guns[4].

[1] How many webmasters would switch from AdSense to PlanB if the rewards were routinely 10% higher?
[2] Lobby, sue, get patents, hire away staff, publish research that might enable hoods to beat TheBigG serps...
[3] With so much future at stake, I doubt M$ will make deals with G
[4] I don't advocate violence, and it hasn't really worked in the past either.


related...