A quick overview of HTTP caching

I was doing a series of micro-edits and changes today on my site design. I successfully saved about 10 times and would hard-refresh in an incognito window and the changes would be reflected immediately.

Until after about 10 saves, suddenly site changes were no longer reflected.

I tried a couple of more times and still no dice. So being that I’ve worked with Varnish and NGINX both as reverse proxies with caching it felt eerily similar. Most likely WordPress.com enables the caching using some rule that I’m not privy to at this time.

Easily confirmed.

So if hard refreshings don’t work. What can you do? Simple: Append a bogus query string to any page. It sounds weird until you know how HTTP caches work.

How do HTTP cache servers cache your page?

Your url, including the query string, become a unique id that is used to lookup content for that site for a period of time. Specifically, they often use a hash function to convert your url into the unique id, rather than the url itself. There are some more deep technical reasons for this, efficiency is one.

Examples

GET https://hankbeaver.com/ -> 99aa13ac0e2c10b2ab4e15f3da1f8885

GET https://hankbeaver.com/?hank -> 36efba57e6d91a28e51bc762ef51d382

Each url will return a different hash(MD5).

These are completely different ids to the caching server. So when I request https://hankbeaver.com/?hank, the cache server can’t find the web page and will go directly to the backend servers and serve up a fresh page. Viola! I can now bypass caching for my tests.

So deep in the innards of a caching server, this id might be broken up into a series of directories and sub-directories like.

/99/aa/13/ac/0e/2c/.../85.html

Regardless, by the time I was writing this section of blog, https://hankbeaver.com served the updated content.

Why did it work after a few minutes?

The request was updated as a result of what is called a TTL, or Time-to-Live in cache servers and other computer systems. Most likely this was set to something like 300 seconds (5 minutes). And after that time has expired, the fresh web page is sent from server.

Why cache at all?

One word: Speed. If you can cache a page, it can make an immense difference in response times and sometimes even more importantly, keep your actual servers and databases from overworking when a site is under heavy load. All popular news, media, social media sites will use some or a lot of caching.

When/What should I cache?

Anything be it HTML, CSS, or JavaScript that rarely changes is highly cachable. Great examples are images, CSS and JavaScript that only changes every release. These are perfect use-cases for caching and should be. They are often referred to as “assets” or “media”.

Real-world Use Case — Realtime application using Twitter Firehose*

*The Twitter Firehose allows access to get EVERY tweet on the platform in near-realtime via a special API key.

Problem:

Create a social media, near-realtime analytics app that visually displays sentiments and data on special tags (that could be updated as the event progressed) as well as certain user account tweets. The application was expected to get upwards of 200,000 uniques over the weekend. So optimizing was incredibly important. This was Big Data, before it was called that.

Solution:

For this project, my focus was architecting something that would be highly cacheable and not require a hoard of servers. We pulled it off with ~3 servers on top of Engine Yard as the front web tier, using a CDN for assets and the Flash app.

The other 3 devs (2 backend and one Flash dev) would take the Twitter Firehose, parse the data at Rackspace Cloud (using a specialized search engine I can’t recall, not Solr) as the the Firehouse receiver. The director of the team also worked with client intensely to tackle and block (in near realtime) any business obstacles we faced. Then the Firehose receiver would shove a huge chunk of data into a Heroku/Redis Ruby app as the backend analytics. Lastly the backend analytics app would push final data into Engine Yard Rails app. All of this occurred every minute or so.

The distinct separation of the apps was intentional and allowed for each Dev to focus on one area and just move data forward. In the end, data was ingested by a CDN-hosted Flash app (that was slick) and was definitely on the forefront of real-time social media at the time, ~2010.

Where caching was critical.

The Rails app at Engine Yard had NGINX as reverse proxy and caching server. When data was updated, this application had to respond to clients (which were constantly hitting the server) with an updated index page which then affected the client Flash application. These requests had a TTL of about 1 min. This meant the Rails app only needed a couple of requests for a small set of servers to go to DB and the latest data, even if I had 1000s of concurrent requests. After that it was practically a static page.

The tools don’t matter, the design does. Caching is a tool. 

I’d like to point out, in this instance, Ruby on Rails can scale and did. Using caching, I built a quick app in a few hours (because this is what Rails excels at) that could be used to drive the entire client app in near-realtime.

Incidentally, this was probably the most intensive project I’ve worked on before – a web application dev bootcamp. 5 hour energy drink was definitely in play here. I’m still friends and have gone on to work with some of these folks.

When should I not cache?

My general opinion is once you have built your web application/site, you need to optimize everything before considering caching, excluding assets. Do not go straight for caching. Caching too early can mask problems in your code, database queries and more that may present themselves later down the road and you will find yourself with a worse problem.

Where I work at PrimeRevenue, caching does not offer a lot of benefits for server requests. Each request is based on a session and gets specialized responses scoped to a login and data. We also have a low amount of concurrency. This is what I call low cachability.

Similarly, your shopping cart at Amazon will likely not be 100% cached. The response is just for you. I could be wrong, just guessing.

How can I tell if I’m being cached?

Normally, I’d say look at status codes. A 304 status code represents a request that has not changed, and thus the server didn’t send an update. However, I tested a few large sites like Twitter, Medium and Facebook. They always sent a 200 status code regardless. But it was clear the response times dropped on most of the follow up requests IF I didn’t do a hard refresh. So it’s hard to say for certain if you are being cached. You have to deduce it.

Conclusion

I hope you found this small overview of  HTTP caching helpful. Next time if a site is showing you ‘stale’ content, just add a random query string. You might be happy to see your request is no longer stale.