Sunday morning I was sitting in my kitchen reading the newspaper and thinking about brunch, when Chris called me, saying he had a bunch of voicemails on his home phone in Iowa claiming something about our web app FaceStat being down. FaceStat is a site we made to show off our Dolores Labs crowdsourcing technology, and has had a small loyal following since we made it live about a month ago.
I checked the site and it gave me a 500 error — only 1 in 10 requests seemed to get me an actual page — so I logged into our app server and saw the disk was full. The log file had grown to 20 GB! I deleted it and asked my friend Zuzka to check and see if we’d been Slashdotted. She found a NY Daily News article that had just gone up talking about FaceStat, but we’d survived the traffic from a Wall Street Journal Buzzwatch post and it didn’t seem like that would be enough to make the log file blow up. I checked the file again and it was back up to 20GB.
I wondered if we were under some kind of denial of service attack, so I called up Brendan to see if he could check it out. He found that our box was getting thousands of hits per second. Searching online, he found a whole bunch of Yahoo Answers questions asking why the site was down.
I’ve been hit by Digg and Slashdot before, but this spike in traffic was like nothing I’d ever seen. Then Zuzka figured out what was going on:
That’s right, we were on the front page of Yahoo.com, the most trafficked site on the web!
I looked in my inbox and it was full of thousands of angry emails like this:
Subject: facestat site would be interesting….
….if it worked. unfortunately, site crashes like this in conjunction with a press release don’t do much for credibility. after all, if site design and construction are poor, why would anyone think that the underlying concept and software have any validity?
Subject: quite the blunder!
Wow, the day you have your site profiled on Yahoo is the same day your site is down. What a stupid blunder!
It turned out that the reason Chris got a phone call was someone had looked up his phone number and posted it in the newspaper article’s forum:
Want to contact owner of the site?!? Chris Van Pelt in Spencer, Iowa is the registered owner and his phone number is: (712)262-8863. The IP address for the site is: 220.127.116.11 and is in St. Louis MO. The ISP is Slicehost LLC out of St. Louis, MO. which explains the IP. His email address is email@example.com Ahhh… the internet a wonderful thing when you decide not to hide personal information. Check out www.ip-adress.com This is all public information no gathering of information has been gathered in any illeagal manner.
I think they were meaning to obnoxious, but the early warning was a huge help . I had been working like crazy and had resolved to take Sunday off; Brendan was about to go out with his friends and Chris was on vacation.
Sorry! Turns out Yahoo put us on their front page without any giving us any warning… (We appreciate the traffic but weren’t ready for a 1000x increase in load ) We’re working on getting the site back up in the next couple hours. Send an email to firstname.lastname@example.org, and we’ll send you an invite when we’re back up, or check back in once we’re off the front page.
Instantly emails started pouring in to email@example.com asking to be notified when they could use our site.
After working so hard to get users to come to your site, it’s amazingly frustrating to see hundreds of thousands of people suddenly locked out. Unbelievably, our webserver (nginx) couldn’t even reliably show that static page… Brendan discovered that we were exceeding the system’s open file limit — set at 100,000 — because connections were counting as open files.
There was an awesome article in Thaindian news that commented:
Facestat.com is a service where anyone can upload their picture and get it judged by the public.
Weather the website has been hacked or it is just some server issues seems to be unknown, we will post any updates as we try to gather more information.
We went to our hosting company Slicehost and started buying up more machines and bumping up the RAM on the database server. Before we were hit we had one app server and one database server, and no automated way of setting up machines. In the previous few weeks, we had been adding requested features as quickly as possible, not worrying at all about performance. While Brendan worked on setting up boxes I started ripping out every database intensive feature of our system and Chris added more caching… Around 1 AM we were back online and looking pretty stable. We thought about moving our database server to a bigger box, but the email system was really unstable: we were dropping invites that our users were sending, so that seemed like a bigger priority to fix. I was pretty sure that the load would be way down since we were off of Yahoo’s front page, so we eventually got to bed by 5 AM…
The next day I woke up and found we were being hit with a load similar to Sunday. I’m still not sure why so many people are still coming or how they’re finding the site, but because our site was actually functioning we were successfully serving up a much higher percentage of page views. Google analytics is telling me almost all the traffic comes from people without a referral URL.
The latency was still making the site almost unusable, at least from my perspective. We rely on the fact that people who upload a photo stick around to judge other ones, but with the high latency it seemed unlikely that they would do that. I was worried we’d be stuck with tens of thousands of photos uploaded and no way to get them judged. We moved the database to a new machine and added memcached. Brendan hacked together some amazing tools to monitor our boxes, which were a random hodge-podge of whatever size slices Slicehost could give us.
So now it’s Tuesday night and the site seems to be cranking along under 50x the load that used to work on one box. We have 6 app servers and a big database machine. I’m really impressed what awesome hackers Chris and Brendan are and what amazing tools are available these days. Slicehost has scaled up as fast as we’ve needed them to. Amazon’s S3 serves all the images, and while the latency isn’t great, we never could have dealt with the bandwidth issues on our own. Capistrano lets us deploy and rollback everywhere; git with github lets us all hack frantically on the same codebase then merge and deploy. God keeps all the servers running, and memcached has given us great caching with very little pain (mostly… ). Brendan would also give a shoutout to ITerm and its crazy multitab input mode, but he can write his own blog post about that… It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two.
I figure at this point we’ve been on the number 1 (or 2) page on the internet, so there’s no bigger instant spike in traffic that could happen to us…
Some lessons I learned for the next time this happens:
(1) Monitor the site better. We had exception handling emailing us, but there were so many exceptions that I didn’t really look at them and I wasn’t online. It wouldn’t have made sense to scale our site to handle this kind of load in advance, but it’s unfortunate we had to rely on random people deciding to lookup Chris’s email address to call his home phone number to yell at him…
(2) Don’t be afraid to put up an error page. We had lots of excited users emailing us when we had a page up saying we were down and explaining why. We had lots angry users emailing us when the site was up but with intolerable lag or crashing intermittently. I think wishful thinking caused us to put up the site an hour or two before it was ready.
(3) A statically generated homepage is a very good thing and memcached is awesome.