Google is releasing a trillion keywords

15 comments

"....we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages." - Google Research

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Comments

Dave's top 10 list

Quote:
we're excited to hear what you will do with the data

And the #1 'thing we'll do with the data' is......automated content generation.

seed the list

When people sell a mailing list it's extremly common for sellers to seed the list with some names that only exist for the purpose of catching people who are misusing it. I would have to assume the boys and girls at the plex would do the same.

Easier than it looks

They just downloaded the terms eBay is running in Adwords.

hehe

That will be 211761201 adwords accounts then.

How about if I create one subdomain for each keyword?

I'm sure Google will index all my new content:.)

graywolf, you have a

graywolf, you have a devious, devious mind. How many other people would consider seeding the terms with some nonsense phrases? I ask you--how many other people would come up with an idea like that?

Well, I guess I can think of a couple people..

Release Yes. Cost?

If you happen to belong to LDC, for the small sum of 20k per year, you can pick up all that data. No other pricing structure is mentioned...

Damn, look at the brains on

Damn, look at the brains on greywolf! Nice catch.

Put me down for the 13,653,070 unique words anyway.

I'll see if I can run some kind of topical sort.

Matt,

Two questions

1. Are those accross multiple languages?

and

2. Has Google (or any company) released access to the stitiscical analysis software similar to what's used in G's language translation?

-q

trust but verify

>graywolf, you have a devious, devious mind.

Should I take that as a compliment?

Seriously, if you work in direct marketing and you buy mailing lists either for email or snail mail, it's incredibly common for them to populate the list with "dummy names" that forward back to the list seller. Generally speaking when you rent a mailing list it's a one shot deal, so if the company gets more than one mailing from you they know you used the list more than you were supposed to and you get hit with a legal action or monetary fine. if you use any third party to send out email or snail mail catalog packages for you and you aren't seeding the list how do you know they aren't "borrowing" you data? Pretty similar to the "trust but verify" policy lots of folks have. As I see it's just smart business practices, not being devious at all.

catching people

Quote:
only exist for the purpose of catching people who are misusing it

Uh, like the word 'googletestad'? LOL

amateurs

re: seeding a mailing list

Only amateurs will get caught in such a "trap". You can sue anyone for anything, but you will have a hard time (and spend a penny or two) pursuing that action. Just because ten mailings were received at a tracking address (or 20?) doesn't mean the list was re-used ten times in violation of a contract. Quick: name 3 ways that *might* have happened.

The best defense is a good offense, right? Ever get hit with an aggressive counter suit? I mean an aggressive one? You may just end up settling to make the counter suit go away :-)

With regard to mailing

With regard to mailing lists, amateurs only get caught in such a trap because only an amateur would re-use a rented mailing list.

Again, with regard to mailing lists, it would be a pretty simple case before a jury to show that a list was improperly reused if a different mailing showed up on a different date to a few of the seeded addresses. That someone could propose 101 ways it might have happened accidentally does not mean that a bunch of silly arguments would fly before a judge and jury. The list owner would get their damages, and, if the list rental contract was written the way it ought to be, the infringer would pay all the costs and attorneys fees of the winning list broker. While I was a big believer in aggressive countersuits when I was practicing law, I don't think there's a good one here.

I'm not sure if this translates to the list of terms released by Google, even assuming Google seeded the list. I doubt that a judge and jury would find seemingly random collections of words to be as probative as a name and an address. Your average juror understands names and addresses from everyday life, and knows how unlikely it is that an exact name and address would just magically appear out of thin air; I don't think they bring the same sort of every day background to apparently random combinations of keywords. I don't know what terms of use are attached to Google's release of the terms; I'm also not sure that any terms of use can be applied to use of the keywords once they start circulating in more or less the public domain.

Of course, Google's response likely would not be to bring a lawsuit. They could sufficiently chill misuse just by cancelling the Adwords accounts or placing in supplemental listings all results from all the IP addresses owned by the folks using the keywords in ways they don't like. The infringer could sue, if they wanted, and if they could figure out what Google had done, but Google can afford the legal fees and, in the unlikely event they lost, the damages. My experience has been that multibillion dollar corporations don't mind the occasional multimillion dollar lawsuit if running the risk of such lawsuits is related to protecting important corporate interests or principles.

In other words, when it comes to this list of terms, it's the same old same old. If you are smart enough to use these terms in a productive spamming scheme, you are smart enough to know the kinds of things that Google might do to you and your sites if you get caught. My guess is that there are folks thinking right now about how to use them, and also prepared to take their punishment like a man if they get caught stepping over the line.

Trap street

graywolf, yes you should take it as a compliment. Not to worry, I'm familiar with the practice. My favorite is Lye Close, the fake street in London: http://wiki.openstreetmap.org/index.php/Copyright_Easter_Eggs

billhartzer, sshhh. I was just watching boogybonbon find out about "google monitor query or googletestad" today. Don't ruin the fun.

Google, a small thankyou for

Google, a small thankyou for releasing this. A large thankyou will depend on whether my membership allows me access and the licence behind it.

>> And the #1 'thing we'll

>> And the #1 'thing we'll do with the data' is......automated content generation

That would be my guess.... it'd need a bit of arm's length deniability (publish'n'scrape, or similar), but I'd bet you could clean the data in a few weeks. Anyone with the right resources could do a decent job fairly quickly, I think

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.