24 June 2008
One of the biggest issues with media websites is dealing with high amounts of traffic. Media is all about getting eyeballs, but if you get too many hits at once, slow performance can cause problems and damage your reputation. This problem is exacerbated by the bursty nature of this web traffic. You can be cruising along at a manageable rate, then get hit with a big news story which causes a big spike. One of our clients have seen spikes of two orders of magnitude in a matter of a couple of minutes.
The general solution in computing to speed up access to the same information is to use caches. If you keep requesting my home page the web server will build up a cache in memory so repeated requests avoid touching the disk.
It's easy to keep a cache for my website, because this page, like my entire site, is entirely static. Most media sites, however, contain a lot of dynamic content. You might not think there's much business logic on your average newspaper website, but once you start looking at advertising links, related stories, special features and the like, things get a good bit more interesting. A travel story to France might link to articles on french food, and advertising that knows that a web browser in Canada is interested in a holiday in the Loire Valley. Personalization makes this even worse, my personalized preferences should generate a personalized feature list on heavy red wines. Such logic is complex in its own right, it makes for a lot of computation with each request, and crucially it ruins most caching strategies.
The way to deal with this is to divide a page up into segments where each segment has a similar determination of freshness. The article on Loire travel can be relatively static, changing only to correct errors. A related article list which feeds off tags for "France" and "Loire" will change more often, but maybe only every few days. If we arrange this properly a request for a page with these two items may be able to gather everything from caches.
The most common way of doing this that I've seen is to form caches on the web server and assemble the page segments when the page gets hit. Tools like Sitemesh are a good option for this approach. As you write the page for 18th century loire delights, you include call-outs for sections like related articles. When you get the actual web request the web server takes the page and assembles the page from the separate pieces. Much of this can be cached in the web server, which avoids hitting the back-end domain logic and database.
An interesting possibility is to go even further and use the many caches that exist in the web itself. Most calls for this web page don't even reach my web server since my page gets cached many times along the way. If you build a web page dynamically and assemble it on the server, you have to take the hit to deliver the page. An alternative is to assemble the page on the client and then draw each segment from its own URL. Each segment could be cached in different places with different caching policies.
How might this work? We might store the static article content as
XHTML at an URL like
http://gallifreyTimes/travel/18-century-loire-delights. Inside that
file we want to insert some related articles by looking up articles
tagged with "loire" and "france". In the static page we put in a
simple "a" tag.
<a class = "relatedLinks" href = "relatedLinks/france+loire">Related Links</a>
in a separate library file. When we download the Loire article the
class: in this case an "a" element with the "relatedLinks"
class. (The behavior
library is a good way to do this.) When it finds the element it
uses the information in the element to synthesize an URL for that
segment. In this case it would use what's in the element's href
attribute to come up with an URL like
it's got that URL it then gets the content and uses it to
replace the original "a" element. Since the related articles list is
handled through an URL, other gets on that URL cause caches through
the Internet to warm up, so there's a good chance that retrieving
the page may never cause a hit on the original server.
In the context of caching, the value is that each progressive
enhancement weaves in a lump of HTML with different freshness
rules. The original page is static, the related links change daily,
but both can be cached independently and weaved together. I can do
all sorts of additional elements, as long as I take care to keep
segment the page by the freshness rules. So I could include a
personalized weather forecast based on the user's profile to every
session, using it to construct an URL like
retrieving the content (which would often be cached on my hard
drive) and weaving it into the page.