Exploiting a Citation Based Algorithm

25 comments
Story Text:

Search Engines such as Google have implemented a number of citation based criteria into their algorithms over the past year. Examples include references to academic whitepapers like Hilltop. Although citation engines are a good starting point for quality search results it is apparent that the citation nature of Google and the search functions/caching of large prominent authority sites can in fact be exploited.

Here is an example:

  • A series of pages are created on a domain say www.mylittlewebsite.com and the links point to a search request on one of these sites, example: copy the link url and paste to see what i mean, it's loooong
  • Notice the formatting using HEX code when surrounded by a standard HREF tag this translates the link properly when the request is made to the authority websites POST for search – the result is properly translated into basic html. This is a clever coding exploit, this format ensures the request is properly formatted in basic HTML.
  • Obviously the request is a negative search result on the authority website, however particularly site searches will cache all results of local searches, successful or otherwise.
  • If these search results are spiderable content, then a robot such as Googlebot will view the cache results and see inbound links from a high profile authority site point to the domain in question.
  • The result: www.mylittlewebsite.com jumps in the rankings.

Now how long will it take Google to patch the hole in their algorithm? Not long I would guess.

Comments

gotta love it when...

we just hand the search engines a list of their holes on a silver platter...

tsk tsk

Now oilman, you're not implying that you would in fact use this hole would you?

*gasp* *shock* *horror*

Why that would be blackhat SEO...

;o)

One does have to wonder why

One does have to wonder why you just didnt keep this quiet and make a stack of cash....

Does one have to wonder?

I make stacks of cash in other ways :)

Also call it professional transparency in addition to the fact my background in CS and Natural Language AI at times override my desire to exploit a hole like this. I don't think it's an easy fix, and I am not referring to delistings.

Now testing 1-2-3... pay attention and let's see how long it takes before this is patched.

Tick Tock....

all in the name of gaguing Google's agility....

black hat

whether or not I'd use it is not my point....

I'd just like to see Google catch this stuff on their own once and awhile ;)

I've been playing around with it a few buddies - I'd love to see the look on the face of the stats guy at when he looks at the days queries - heheheh

JoeAnt

JoeAnt had this problem until last week.

Of course I didn't see the money making oppurtunity in it so I told them.

Stupid me.

The problem is ...

... This would only work if the target authority website was coded so badly as to allow such trivial cross site scripting, and as you say if "the search results [were] spiderable." Both are unlikely on authoritys site, which usually have higher than usual security given their high profile nature. More of a hypothetical loop hole, than a real one.

More of a hypothetical loop hole, than a real one

you're not looking hard enough. We've found a ton spidered results pages on many authority sites so far. The key is that the results page has to echo back the search in the URL.

oh yeah....

and we've also found these authority sites SERPS ranking for some really spammy phrases themselves. Nifty ;)

Why did they just do anchor text

Why just anchor text ?
Why not a javascript redirect too ?

ahh come on guys....this

ahh come on guys....

this isn't anything new, we even discussed it a couple of weeks ago (though admitably the context was slightly different).

Bad code has been about for years and will almost always continue to be written. To be fair to the engine's it's not so much a bad algo (The algo's may be bad but that is for another discussion another time:) ) as a crappily coded application running on the target web site.

Personlly I wish this hadn't been posted as it is highly likely I'll lose a tool in my arsenal but that's the name of the game and development and investigation will carry on trying to find those "new toys and techniques"

Quote:
Why just anchor text ?
Why not a javascript redirect too ?

A JS redirect will get the site you're playing with a nice email from Matt and his team along with a 30 day ban. I reckon that the damage is XSS is the data that can be gleaned. Everything from cookies or (using AJAX style coding) rewriting the DOM with your own data!

a variant

a variant of referer spam a.k.a. trackback spam, no? If the website publishes the user submitted content without sanitizing, they get code injections. Spammers inject hrefs, hackers inject shell scripts. That's what nofollow was intended to address ("injecting" backlinks into comments, right?)

I agree with Todd about the highlighting of it here. Seems odd. Everytime I write about security/privacy I end up dumping many drafts cause it just doesn't make sense to give it away. 1% of those who would benefit from the knowledge will actually follow the advice, while 95% of those looking to plug such loopholes will plug them on first awareness. With SEM it's even worse... more actors is more competition up until it gets plugged (and it will get plugged faster).

No andrews

This is a totally different thing than trackback/comment spam.

nah. Not totally different.

Sure, I might be missing the point but I can execute this "exploit" as described and see it as not very different from hunting down sites that publish their referer logs or sites that are so proud of the SE referals they display them as dynamic content on their pages. Never publish unsantized user data (that includes GET strings)

...

The difference is that comment/refer log spam you compare this too is that is taking advantage of the way a system was designed to function. Its "greedy" behavior.

This however is not by design. The developers didn't see the flaw, or how the data would get put in the public domain. This is an exploit of a vulnerability.

OIC

So while one is "greedy" another is "exploiting market inefficiencies". I understand , thanks ;-)

this isn't a bug in Google...

No more than comment spam or log spam is. It's just another exploit of insecure scripts. Has anyone tested this against BC's DSM script? That's designed to make sure search engines crawl site search pages, and it has an easy to identify footprint. It would be a shame if all those sites were unwittingly hosting links for Texas Hold 'Em Guy.

Agree

This is not a Google bug - definately not, but like link spam in general it does artificially affect their ranking algorithms the same way other link spam does. Off course, XSS can be (ab)used for far more than just link spam, but thats another storie :)

I don't know of any sites actually using BCs DSM scripts but if you could sticky me a few URLs of sites using it it's pretty easy to test. In case it's not safe BC should be told :)

Is this right

Want to make sure I get this right this is how I get a Cat Blog to rank using an authority site like Google.

I think...

This "hole" sounds familiar to me. I remember reading about it waay back in like.. 2001? The only difference is the hex code.

I'm not even sure the HEX is necessary.

> I'm not even sure the HEX

> I'm not even sure the HEX is necessary.

It all depends on what kind of application you are trying to attack. Sometimes you need HEX or other formats to actually get in. However, you are right that most systems are left wide open

The Hex is useful

The Hex is useful for the inbound link you supply to the page, to get it spidered by SE's. It also hides the intent from the eyes of the casual observer.

The only problem with HEX is

The only problem with HEX is that its so terrible long :) Often yu run into carachter limits in the 40-50 range and thats not a lot using HEX. If you mix HEX with other types you can make it shorter and still very hard to decode from just looking at it :)

DSM...

Quote:
I don't know of any sites actually using BCs DSM scripts

Just a little joke, Mikkel. Last time I checked, between the top 4 search engines, a whopping 8 sites had DSM pages indexed. Google had them as supplemental results.

**subscribing**

**subscribing**

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.