Archive for October, 2007

Christopher Hitchens

Wednesday, October 24th, 2007

As a liberal with an appreciation for thoughtful contrarian arguments I feel like I should appreciate Christopher Hitchens, but his unending support for the Iraq war really drives me crazy.

Hitchens recently wrote an article in Vanity Fair where he discovers that his writing inspired a smart idealistic kid from a rich liberal family to go off to Iraq looking for honor and sacrifice. The kid ends up getting himself blown up by a land mine.

I can definitely see myself being that kid in another life, and if I was that kid’s parents I would not be so nice to Christopher Hitchens.

A Stable Statistical Constant for Human Language Texts

Thursday, October 18th, 2007

Zipf’s law is a neat empirical result that says in all languages the frequency of the nth most common word is roughly proportional to 1/n, or more generally n^s where s is constant. It seems very cool that all different languages have this property until you realize that under many common distributions the nth most frequent item will have a frequency of roughly 1/n.

A coworker pointed me to this very interesting paper, A Stable Statistical Constant for Human Language Texts to appear in RANLP-07. It shows that many different languages, including English, German, Punjabi, Urdu, etc. all converge on roughly the same proportion of repetition of characters. On the other hand, non-human corpora such as the linux kernel or machine generated bigram models do not converge on this number. Even Chinese when written phonetically in pinyin converges on the same V=~0.5 (written in standard pictographic characters it doesn’t).

Do all writing systems converge on a somehow optimal value of repetition? That would be pretty cool.

Cracking Go

Thursday, October 11th, 2007

IEEE article about one of the members of the Deep Blue team trying to make a championship level Go program.

I believe that a world-champion-level Go machine can be built within 10 years, based on the same method of intensive analysis—brute force, basically—that Deep Blue employed for chess. I’ve got more than a small personal stake in this quest. At my lab at Microsoft Research Asia, in Beijing, I am organizing a graduate student project to design the hardware and software elements that will test the ideas outlined here. If they prove out, then the way will be clear for a full-scale project to dethrone the best human players.

The article left me very skeptical. Go has to main problems: the branching factor is high and static evaluation is hard. Most people coming from chess, including Feng-hsiung Hsu seem to focus on the branching factor problem. But without a very fast function that maps from positions to a somewhat reasonable guess at who is ahead, no Go program will ever work well.

It is really hard to tell who is ahead in Go. Even figuring out which group is alive or dead can require searching forward thirty or forty ply and often groups are half alive or half dead. But just alive or dead won’t give you a good estimate of the final scoring. The caching strategy the author talks about at the end of the article only solves a small piece of the problem.

It will be interesting to see what they come up with.

How can people play online backgammon for money?

Wednesday, October 10th, 2007

There are a large number of sites on the web offering the chance to play backgammon online for real money. The stakes are anywhere from 1$ to 100$, and there seem to be a large number of people playing on them.

My question is: given that the strongest computers are far stronger than the strongest humans how can these sites possibly survive? Why would anyone bet on them when someone can easily build a bot better than any human using the open source gnubg? Some of the sites have various responses but it seems like it would be pretty easy to inject noise or human moves in a way that would make it undetectable.

My coworker and backgammon partner Manny decided to join one of these sites mostly looking for strong backgammon competition. His justification was that chess sites that don’t filter for bots such as Yahoo are full of bots, while chess sites that do have no bots. (He’s a chess master who has spent a lot of time playing online chess and is sure that he can recognize bots He signed up for bgroom.com decided to invest 100$ as an experiment, but quickly lost it all. It wasn’t clear if he was playing a bot or not, so it’s still a mystery whether or not these sites are really just backgammon bots playing each other.

Personally, I would pay a small amount of money once in a while for a serious game of backgammon with a strong opponent. But I would feel cheated if I was playing a bot. I can’t quite explain why that is though.

Doomsday Argument

Monday, October 8th, 2007

If you order all the births of humans and you think that your birth number was chosen randomly, it is unlikely that you would be in the first 0.1% of humans. You can use this to infer confidence bounds on the number of humans that will ever live. But because population has been increasing exponentially, any such argument will have the last human being born in a relatively short amount of time.

The Wikipedia page puts it well:

If we take that 60 billion humans have been born so far (Leslie’s figure) then we can say with 95% confidence that the total number of humans, N, will be less than 20·60 = 1200 billion.

Assuming that the world population stabilizes at 10 billion and a life expectancy of 80 years, one can calculate how long it will take for the remaining 1140 billion humans to be born. The argument predicts, with 95% “confidence”, that humanity will disappear within 9120 years.

Stemming

Thursday, October 4th, 2007

Zawodny linked to this post making fun of MSN Search for bragging about stemming. The article says:

Understanding “driving” vs. “drive”? That’s a pretty basic problem called stemming. How basic? Put it this way, there’s been open source code out there to this before there was a Microsoft Search. Fuck, even the Wikipedia page has existed since 2007. All they had to do was go there. I know Microsoft is big about “eating their own dogfood”, but damn, use Google just this once to find it.

Now et’s assume for a second, they’ve had stemming and it’s a problem that’s actually harder — something like knowing “car” and “automobile” are the same thing. Now, granted that’s harder, but here’s the thing — when I transfered down to Yahoo Search Marketing, some 4-5 years ago, they already had this technology. The process was called canonicalization, and by the time I got there, it was an old, well-established piece of technology. So well-established that despite being fairly technical, everyone in the company — business, product, support, etc. — knew it was, at least at a high level, its purpose and so on.

This is ironic for many reasons. First of all, if I worked for Yahoo Search Marketing a few years ago, I would not be bragging about its relevance algorithms. Second, when did Yahoo Search start doing non-trivial stemming? Trust me, it wasn’t that long ago. And it’s not because people didn’t know what stemming was.

I find myself making two points about stemming and relevance over and over and I want to make it here once and for all:

(1) Just because someone is stemming/canonicalizing/spell-correcting in their presentation of search results does not mean they are stemming in their retrieval algorithm.

People always try to check “does Google do stemming?” by searching for “cars” and checking if a result has the word “car” highlighted. Many of my coworkers at Powerset and Yahoo who actually work on search engines have made this mistake.

Think about it like this: you are searching over 10 billion documents and often matching and ranking millions of them. Then you are highlighting 10 documents. Don’t you think that you might have different algorithms for displaying the results?!

(2) Stemming keyword search is really dangerous. Getting an overall relevance improvement is really hard.

One reasonable-sounding engineering strategy is to index the stemmed form of every word you see on the web and then search for the stemmed form of the query. This would be a complete disaster.

If you get a search for [cars] and you change the search to [cars OR car] it’s no big deal, but what about when someone searches [aids] and all they can get back is documents about [aid]? How do they undo it? You can always change your search for cars to [car] with stemming off, but how can you search for just [aids] with stemming on? There is a + operator that works for the major search engines but how many users know about it?

People who don’t work on relevance for web scale search engines often don’t realize just how amazingly useful anchor text is. Anchor text for a given page is the text in the html links from other pages linking to it. For a popular page (and these are the most important pages for a web search engine to return for keyword queries) it’s like tens or hundreds of different people sat down and annotated the page with titles. The best ten pages about cars most definitely have hundreds of links with the words car, cars, automobile, autmobile, etc. Stemming and canonicalization will never help you on a one or two word web query.

It will help you on an ad search query which is why Yahoo Search Marketing had stemming long before Yahoo Web Search. I am sure that Google/Yahoo/MSN all continue to have far more stemming on the query that gets sent to the advertising servers than the web servers.

Sacco’s letter

Tuesday, October 2nd, 2007

From Sacco’s last letter to his son Dante before he was wrongfully executed by the state of Massachusetts:

Well, my dear boy, after your mother had talked to me so much and I had dreamed of you day and night, how joyful it was to see you at last. To have talked with you like we used to in the days–in those days. Much I told you on that vist and more I wanted to say, but I saw that you will remain the same affectionate boy, faithful to your mother who loves you so much, and I did not want to hurt your sensibilities any longer, because I am sure that you will continue to be the same boy and remember what I have told you. I knew that and what here I am going to tell you will touch your sensibilities, but don’t cry Dante, because many tears have been wasted, as your mother’s have been wasted for seven years, and never did any good. So, Son, instead of crying, be strong, so as to be able to comfort your mother, and when you want to distract your mother from the discouraging soulness, I will tell you what I used to do. To take her for a long walk in the quiet country, gathering wild flowers her and there, resting under the shade of trees, between the harmony of the vivid stream and the gentle tranquility of the mothernature, and I am sure that she will enjoy this very much, as you surely would be happy for it. But remember always, Dante, in the play of happiness, don’t you use all for yourself only, but down yourself just one step, at your side and help the weak ones that cry for help, help the prosecuted and the victim, because that are your better friends; they are the comrades that fight and fall as your father and Bartolo fought and fell yesterday for the conquest of the joy of freedom for all and the poor workers. In this struggle of life you will find more love and you will be loved.

I can’t remember exactly where I read that paragraph for the first time — I think it must have been in high school — but somehow it implanted itself in my brain so much that I was recently able to find it by keyword searching through
an archive of letters at the University of Missouri’s website
.