Thursday, July 24th, 2008...3:49 pm
Caching: An Introduction
By: Robert Baskin (Online Director)
Jump to CommentsWell, it’s finally time to talk about how we use caching here at the Yale Daily News. I initially promised a post, but after starting this first one and realizing how long it was getting, it will now be a series of at least two and maybe more posts! Allllright!
When I started making the Yale Daily News site, I had very little experience working with caching. Before, I really only had experience with small-scalle Web sites. The YDN operates on a slightly different scale. In order to handle that kind of traffic, we’ve had to work very hard to implement many performance enhancements. This post describes how caching has saved our site numerous times.
Our performance problems began in February 2007, just a few short weeks after the current version of the site launched. I naively thought that bandwidth and disk space were the only metrics that mattered. That’s not the case! CPU usage and memory are very important to handling lots of visitors to your site! That may sound extremely obvious, but if you’ve only worked on shared hosts before, you might not think about that. We were on Dreamhost, which I had used successfully for many smaller sites. But the YDN site is large and complex, and generating each page is resource-intensive. We were Drudged in early February, and Dreamhost notified us that we had outgrown our account and needed to move.
I then decided to order a dedicated server from Softlayer. We moved to that server, which was much more powerful, and ran acceptably there. Our load was still fairly high, hovering between 1.5 and 2, but we weren’t crashing. However, as we added more features (and thus more complexity) to our pages, performance became a more serious problem. The traffic our China blog attracted in May 2007 exposed these problems even further, and I had to reboot our server more than once after excessive traffic made it crash. We needed caching badly.
Summer finally arrived, and though I was busy with an internship, I had some time to turn my attention to caching. I actually attended a SuperHappyDevHouse event in late June 2007 (I’ve never felt so geeky in my life), and while sitting there watching people hack in Lisp and Ocaml, I came up with the initial caching implementation for the Yale Daily News.
At first, I implemented data caching. At the time, we were running every SQL query on every page. While they were often very fast, there were frequently quite a few of them. (Such is the downfall of using an ORM like CakePHP provides for us – it makes development much easier but at the cost of often doing more than you need in order to accommodate all the use cases a framework needs to support.) Additionally, there are some queries we run (for example, getting the Most Popular stories or Frequent Coauthors), that are expensive and time-consuming. My goal was to cache the results of those queries.
What does that mean? Let’s take the home page as an example, as it’s the most-visited page, one of the most complex, and the one where I started. There is one main data call to the model that returns an array of all the data we need to generate the page. (A model is part of the MVC pattern and is in charge of modeling data, often a table in a database. If you’re not familiar with that, just understand that we called a function that queried the database for all of the data we need (issue, articles, authors, photos, etc.) and returned an array of all of it.) The goal is to run all of that code and SQL queries once, save the result, and then use the cache for subsequent views instead of running the complex and time-intensive code necessary to generate the results. How does that work? Let’s check out some example code:
if ($cacheHelper->isset(’indexData’)) {
$data = $cacheHelper->get(’indexData’);
}
else {
$data = $model->getData();
$cacheHelper->set(’indexData’, $data);
}
Let’s break that down line by line. $cacheHelper is a class I created to wrap all of the necessary caching logic. It has a function called isset(). That function checks the cache to see if that cache object exists. At first, we were using file system caching. That meant that we stored each cache object as a file in a directory. So for that example, we stored the data in a file called “indexData”. So isset() would check to see if there was a file called “indexData” in the cache directory. If there was, the get() function in the cacheHelper would read that file, and the code would store the results into data. If the file didn’t exist, the model would run its expensive getData() function, and then the cacheHelper’s set() function would create the file indexData and store the data in that file. The next time the code would run, that file would now exist, and the expensive $model->getData() line would never be reached.
What exactly was stored in the cache file? We store the results of PHP’s serialize() function into the cache object. When we read the cache object, we call unserialize() on the data. 99 percent of the time, we store an array, which is how our database data is returned to use by the model.
We put this caching logic into several important places on the site – the index page, the article page, the author page, etc., and saw massive improvements! Now, instead of running queries and sometimes taking seconds (years in Web server time) to generate the page, it would take well less than a second! Load dropped dramatically! Major improvements!
This is the first in the caching series. It got long so I decided to split it up. Look for future posts about the following:
• Wrapping that code up above into one function and how that exposes PHP’s lameness
• How we handle expiring caches
• Changing from file system caching to memory caching
• View caching – the motherload
4 Comments
August 5th, 2008 at 1:04 pm
[...] about Caching: An Introduction [...]
August 14th, 2008 at 6:42 pm
[...] how we using caching here at the Yale Daily News. If you haven’t had a chance to read parts one and two, you should check them out [...]
August 24th, 2008 at 1:11 pm
[...] how we using caching here at the Yale Daily News. If you haven’t had a chance to read parts one, two and three, you should check them out [...]
September 28th, 2008 at 11:36 pm
[...] view caching, which is where the real awesome technology happens! If you haven’t read parts one, two, three and four, go back and take a look at them first – this post will make more sense that [...]
Leave a Reply
You must be logged in to post a comment.