Friday, April 25th, 2008...12:10 am
The Day the Music Died
By: Robert Baskin (Online Director)
Jump to CommentsLast Thursday around 8 a.m., I received a phone call from our Online Editor informing me that the Web site was down, and politely asking if I could get it back online as soon as possible. It seems we had been linked to by the Drudge Report (to our story on Aliza Shvarts), and the deluge of hits had knocked us off the Internet. Over the course of Thursday and Friday, we were linked to (often simultaneously) by Drudge, Digg, Reddit, Gawker, Perez Hilton, Fox News, MSNBC and others. On Thursday, we received 12 times our average amount of traffic. In this post, I’ll go over some of the technical details of what happened over that two-day period, and some changes we’re implementing to make things better.
The first goal was to get some page online, instead of just appearing down. Not loading anything was the worst thing that could have happened. Fortunately, we have a check built-in very early in our application to see if it can connect to MySQL. All we had to do was change the MySQL password to be incorrect, and the application starting redirecting to our error page. That’s fairly simple for the server to handle — all it has to do is try and connect to MySQL, fail, and then redirect to a static HTML page. (We didn’t shut down MySQL entirely because it also serves some other sites besides yaledailynews.com, and we wanted those to continue working as much as possible.) But an error page definitely isn’t ideal, and if we didn’t get the site back up quickly, Drudge would drop its link. We would lose the traffic, and it would be less likely to link to us again.
The next step was to get something showing up for Drudge users. We copied the text of the article and pasted it into a static HTML file. Then we had our application redirect all users coming from Drudge to that static HTML file with the popular story. Our server was able to handle that. But it was erroring for everybody else, which was no good. So we put a link on the error page to the static version of the story, which seemed like a good idea at the time. However, as our editors informed us, we didn’t want to seem like we were blowing the story up unnecessarily, so we removed the link and went back to the drawing board.
The story was continuing to explode. Our static page was holding up, but we really wanted to get all of our stories back online. We had implemented view caching (I will blog about this in the future), and the pages were being cached in memory in XCache. However, I noticed that we were hitting our memory limit as Apache processes were spawned. Our caching system can fall back to file-based caching if we tell it to, so I figured we could try that. I allowed only my IP address to be able to access the main site, and clicked on a couple pages to prime the caches. This is important — if we started our site up with an empty cache, the server would overload trying to fill the cache as people started visiting. I primed the cache, then opened things up.
Wonder of wonders, miracle of miracles, the site stayed alive. It was a little slow, but it was going. We kept users with Drudge in their referrer going to the static site. Drudge’s traffic is overwhelming — it dwarfed all of the other referrers, which are not small. With this setup, we managed to stay alive for most of the rest of the time.
On Friday, we hit the front page of Digg, Fox News and Drudge at the same time for our follow-up stories. I made static pages for each of those stories, and routed all traffic just to those specific URLs to the static pages. With those optimizations, we made it through the week, and over the weekend traffic subsided to manageable levels.
Going forward, we are implementing some changes very soon. First of all, we are going to move from our dedicated host to a virtual private server. We’ll be able to rid ourselves of CPanel, which is a waste of resources if you can manage a LAMP server adequately by yourselves. Also, we’ll be able to resize and get more resources in a matter of minutes, rather than hours, in case we get another spike. Additionally, we’re considering moving some of our static files (CSS and background images, JavaScript files) to Amazon S3. That will result in faster downloads for our visitors, and Apache won’t have to serve as many requests. Even though it’s fast with static files, it can only help.
So that’s what happened. As we implement some further changes, I will blog more about them. The goal is to be able to survive a Drudging or Digging or any major linkage. We are close to getting there, but we have some work to do.
5 Comments
April 25th, 2008 at 1:17 pm
Just a word of warning - if you’re planning to move to slicehost, you will not be able to get extra capacity “in a matter of minutes”. I resized a slice recently form 256 to 512, with around ~5gb of data on it. It took 30 minutes to prepare!
If your apache processes are taking over your memory, you should take a look at your MaxClients in your httpd.conf.
Let me know how S3 turns out. It’s a big rage to move your statics to S3 so make it faster for the *user* - but I wonder if it reduces the load on apache significantly as well.
April 26th, 2008 at 2:31 pm
Neodude - didn’t know that about Slicehost, I was under the impression it was faster. Oh well, 30 minutes is still pretty fast, and the site is still up during that time right?
I will take a look at that setting. I’m trying not to mess with the current server that much, because we’re switching fairly soon.
I’ll definitely post about S3 when we move to it. I feel like not having to serve ~20 static files would have to help server load.
May 3rd, 2008 at 1:52 pm
The site is up during; I think it only goes down for a couple of minutes, basically for a reboot.
I’m Thomas, btw, but I can’t figure out how to get wordpress to show my full name.
May 3rd, 2008 at 2:03 pm
I made it so that Thomas will display - not sure why you couldn’t do that - maybe you could fix it during GSoC?
July 24th, 2008 at 3:49 pm
[...] caching. Before, I really only had experience with small-scalle Web sites. The YDN operates on a slightly different scale. In order to handle that kind of traffic, we’ve had to work very hard to implement many [...]
Leave a Reply
You must be logged in to post a comment.