Congressional Power Chart

March 9th, 2008

Congressional Power

The Monkey Cage shows an interesting chart of congressional power compiled by Knowlegis.

Boosting

March 4th, 2008

I don’t get much time to read papers these days, but this JMLR article called Evidence Contrary to the Statistical View of Boosting was fascinating (found on Inductio Ex Machina.)

There’s a format where the authors write their thesis and then a few people respond and the authors write a short counter response. There is a gold mine of practical tricks on how to get good performance with boosted decision trees.

One of the main questions is in boosting the article discusses is should weak learners be restricted to the minimum possible power necessary to fit the data. For example, should an additive model be restricted to stumps as weak learers. The textbook answer is yes, but in practice having tree sizes larger than the number of interactions can improve performance. This discussion came up more than once in previous jobs.

Friedman makes a great point that boosted decision trees are optimizing some loss function on the probability of output, but the authors claims are all based around classification accuracy.

reCAPTCHA

February 27th, 2008

Last week I saw Luis von Ahn give a talk about reCAPTCHA, his latest project to stop spammers and do OCR on scanned books at the same time. Normally, CAPTCHAs are used by sites like Ticketmaster and Yahoo and show you an image of a random collection of characters that are distorted in a way that humans can read them but computer programs can’t.

reCAPTCHAs take text from old books that have been scanned in, but the OCR program had a low confidence in its transcription of the word. It shows that word to a user and at the same time it also shows a different word that the OCR correctly transcribed.

If the user enters the known word correctly, they are assumed to be human. The users transcription of the unknown word is then used as a gold standard transcription. I believe Luis said that if they require two people to transcribe a given word in this way, the accuracy is above 99 percent.

Apparently users on the internet do something like 60,000,000 CAPTCHAs a day, and transcribing a word costs around 0.5 cents, so this project is making lots of transcriptions of books that wouldn’t otherwise be possible. It helps people with bad eyesight and makes a great training corpus for OCR research.

The beauty of this project is that it will always be one step ahead of OCR programs. CAPTCHAs have had to get tougher and tougher over the years as OCR systems get better. But as long as Luis is using an OCR program that is near state of the art, if his program can’t figure out the correct transcription it’s an impossible task for other OCR programs.

I’ve set it up so that if you try to enter a comment on this blog, you can see reCAPTCHA at work. :)

Anti-Portfolio

February 26th, 2008

I love that these VCs (Bessemer Venture Partners) have an Anti Portfolio of good investments that they turned down.

Contrast that with Battery Ventures saying that passing on Facebook “may turn out to have been a mistake”.

Political Circular Mill

February 24th, 2008

Pop-science favorite, Wisdom of the Crowds, talks about how army ants using simple local heuristics will occasionally start following each other in a circle until all or most of them die of exhaustion. I was poking around on the internet about this phenomenon and found two articles:

Republican Ants March in “Circular Mill” of Death

Political Entomology, Part II: Liberal Ants and Their Circular Mill

These aren’t responses to each other, the authors seem to have come up with their observations completely independently.

Why is the NY Times IT department so good? (and the BBC so bad?)

February 21st, 2008

How did the NY Times get such awesome engineers and designers? They use cool new technologies like Hadoop/EC2/S3 to deploy archive search and push out beautiful and informative interactive graphs every week.

In contrast, the BBC deploys the infamous Perl on Railsbecause their infrastructure sucks.

If you just look at the front pages, it’s clear that the NY Times is using the web medium much better than the BBC.

I wonder how this happens. Is it a difference in budgets? A few critical decisions that go right or wrong? The personality of the person in charge?

Hollywood and Silicon Valley

February 20th, 2008

I have a random distant connection to Marshall Herskovitz, the creator of Quarterlife (and a lot of well known mainstream TV shows and movies). Recently I passed his contact info along to my friends who founded Episodic, an online video advertising startup. It sounds like he’s a really nice guy and the meeting went well.

He wrote an article about Silicon Valley in Slate today, and I can’t help but wonder if meeting my friends had some influence on it.

He says,

Geeks, engineers, and boys. And because the DNA of the Internet is entirely male, it exudes the best and worst of what males have to offer. On the plus side—it’s brilliant, complex, competitive, audacious in how it’s changed our way of organizing experience. On the negative side—it’s linear, utilitarian, cold, emotionless, disconnected.

I think it would be slightly more specific and accurate to say “the DNA of the Internet is entirely nerd”.

If the Slate article is Hollywood’s take on Silicon Valley, Marc Andressen’s excellent post, Rebuilding Hollywood in Silicon Valley’s image, has to be Silicon Valley’s take on Hollywood.

IPaper

February 19th, 2008

My friends at Scribd just launched ipaper, a nice alternative to PDF. The demo looks very nice. More discussion on digg.

Yahoo using Hadoop

February 19th, 2008

Cool to see Yahoo has switched over its webmap search infrastructure to the open source Hadoop project.

From Jeremy Zawodny’s blog:

OLPC on Mechanical Turk

February 12th, 2008

I got one of the one laptop per child laptops a few months ago. It’s pretty cool.

I was really surprised to see a task on Amazon’s Mechanical Turk to review the laptop in the OLPC forums for five dollars. How many people with OLPCs are on Mechanical Turk? Me I guess… Made an easy five bucks.

olpc-turk