Google and Latent Semantic Indexing
Google Latent Semantic Indexing Technology
http://www.seobook.com/archives/000657.shtml
Aaron Wall takes a look at Googles Latent Semantic Indexing technology in light of recent shake ups in the main index. He says that Google have increased the weight of the LSI part of the overall algorithm and that does seem likely as many around the web concur.
A brief definition:
Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.
I've also been told that Using Semantic Analysis to Classify Search Engine Spam - an older stanford document is well worth considering when looking at recent changes...
So, questions for the technically inclined as my eyes start to glass over with this stuff :-)
- Is this what is going on at Google?
- What's the best way of handling it?
Go ahead and thrash out your theories, it's all speculation but it certainly seems likely so getting to grips with it is a must...


Some links
Curtesy of Marcia/WMW
Latent Semantic Indexing in Google update
If LSI is so important
Why does "miserable failure" still show Bush?
tilde
I notice from your trackback chris you posted what i was desperately trying to remember this morning when writing this :)
Use ~keyword to find words related to the subject...
Even better
Do searches using tilde, scrape the serps and pull out the words in bold, de-dupe to make lists ;O)
Related Words
Post at WMW showing evidence that related words are now being highlighted in main index without using ~ command
Search for '[subject matter] tutorial' - notice related terms such as 'how to' are also showing as bold in search results
It might be also be worth loo
It might be also be worth looking at some old tools.
As an example is an analysis of this page and you might want to check out related words to..test
more than LSI
There was definitely some other things put in place but there does seem to be some indicators that LSI was tuned up.
Mis Fail
Chris,
I doubt LSI/A is being used - it's not the most practical method for a search engine to make semantic connections, but it is something we should all think about. As for Bush and Miserable Failure - they probably are semantically connected by now. Let me do a quick C-Index:
OK, it's a little more complex than I thought because there's 'George W Bush' and 'George Bush' and higher C-Index is with W - about an 8, similiar to his C-Index with 'Richard Cheney'...
I know Nick linked to this on
I know Nick linked to this one earlier today but you should read down to post#50
http://forums.searchenginewatch.com/showthread.php?p=33069#post33069
The post is by a newcomer who seems to really have a handle on the subject. He also links out to his MSN blog
http://spaces.msn.com/members/search-science/
Which has links to some pretty advanced computing subjects, AI, and Search Engine Books. Also has a link to search engine designer and developer confrence coming up this April. Sounds an awful lot like an insider to me.
Small Steps
Everyone seems to assume that LSI would be applied by using the entire corpus of docs in the index. What if they started with say, what they deem to be most important? Like links?
Further, during their research process is it plausible that they might stumble on bits of LSI tech that can be used without creating an entire index ranked with LSI?
Pure conjecture on my part. The engines are hiring LSI folks for a reason and it's probably not for conversation. ;)