Google and Latent Semantic Indexing

11 comments
Thread Title:
Google Latent Semantic Indexing Technology
Thread Description:

Aaron Wall takes a look at Googles Latent Semantic Indexing technology in light of recent shake ups in the main index. He says that Google have increased the weight of the LSI part of the overall algorithm and that does seem likely as many around the web concur.

A brief definition:

Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.

I've also been told that Using Semantic Analysis to Classify Search Engine Spam - an older stanford document is well worth considering when looking at recent changes...

So, questions for the technically inclined as my eyes start to glass over with this stuff :-)

  • Is this what is going on at Google?
  • What's the best way of handling it?

Go ahead and thrash out your theories, it's all speculation but it certainly seems likely so getting to grips with it is a must...

Comments

If LSI is so important

Why does "miserable failure" still show Bush?

tilde

I notice from your trackback chris you posted what i was desperately trying to remember this morning when writing this :)

Use ~keyword to find words related to the subject...

Even better

Do searches using tilde, scrape the serps and pull out the words in bold, de-dupe to make lists ;O)

Related Words

Post at WMW showing evidence that related words are now being highlighted in main index without using ~ command

Search for '[subject matter] tutorial' - notice related terms such as 'how to' are also showing as bold in search results

It might be also be worth loo

It might be also be worth looking at some old tools.

As an example is an analysis of this page and you might want to check out related words to..test

more than LSI

There was definitely some other things put in place but there does seem to be some indicators that LSI was tuned up.

Mis Fail

Chris,

I doubt LSI/A is being used - it's not the most practical method for a search engine to make semantic connections, but it is something we should all think about. As for Bush and Miserable Failure - they probably are semantically connected by now. Let me do a quick C-Index:

OK, it's a little more complex than I thought because there's 'George W Bush' and 'George Bush' and higher C-Index is with W - about an 8, similiar to his C-Index with 'Richard Cheney'...

I know Nick linked to this on

I know Nick linked to this one earlier today but you should read down to post#50

As you can see, Google has quite a choice of methods, and I doubt that LSI would be the best one considering the task at hand. The Google algorithm is complex and uses many methods found in information retrieval, data mining, and A.I. It is very unlikely that one method as routine as would be the main formula in this mathematical bundle. Also this issue only addresses semantic similarity not ranking, which i think is your priority.

Of course document similarity and topic detection are the main ways of returning relevant documents, however there are so many ways to do this and none of them are straight forward. In fact no one has yet found a stable way of applying methods that work very well in digital library collections to web data. The problem with data on the web is that it changes all the time. It's dynamic and unstable.

I keep a blog which deals with computing science methods where I explain these, and topics are based on things that I find in forums like this one, just to clear up any misunderstandings. I have no dealings with SEO, but visit SEO forums to assess how far professionals have come to using search and understanding it’s techniques. I work in A.I and computational linguistics. Sorry about the long post.

http://forums.searchenginewatch.com/showthread.php?p=33069#post33069

The post is by a newcomer who seems to really have a handle on the subject. He also links out to his MSN blog

http://spaces.msn.com/members/search-science/

Which has links to some pretty advanced computing subjects, AI, and Search Engine Books. Also has a link to search engine designer and developer confrence coming up this April. Sounds an awful lot like an insider to me.

Small Steps

Everyone seems to assume that LSI would be applied by using the entire corpus of docs in the index. What if they started with say, what they deem to be most important? Like links?

Further, during their research process is it plausible that they might stumble on bits of LSI tech that can be used without creating an entire index ranked with LSI?

Pure conjecture on my part. The engines are hiring LSI folks for a reason and it's probably not for conversation. ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.