Ticket #540 (closed story: fixed)
Implement caching in a systematic manner
Reported by: | rgrp | Owned by: | wwaites |
---|---|---|---|
Priority: | awaiting triage | Milestone: | |
Component: | ckan | Keywords: | |
Cc: | Repository: | ckan | |
Theme: | none |
Description
Change History
Note: See
TracTickets for help on using
tickets.
Cut-and-paste from ckan-discuss:
I had a look at Varnish and I agree that the configuration language is complicated. In fact by default Varnish disregards cache control headers and in general behaves in a very standards non-compliant way. I have no doubt that it is very fast -- if you are willing to spend the efford to customise its configuration for the exact layout of pages and headers and such that each web site it is going to be used with will use. In other words, there is a large administrative burden.
So I decided to change tack and see where the Squid proxy has gotten to in the decade or so since I last met it. Squid is a general purpose caching proxy that can be configured as an http accelerator. The configuration is simple. You tell it where your web servers are for which sites. The web servers make sure to set the cache control headers appropriately.
Here are some results from my testing, against http://de.ckan.net/package/list?page=B which is an example of a slow page. Except for the first, which only did 100 requests, the tests were set to 8 simultaneous connections and a total of 1000 requests.
The results are clear. Using the application cache is about 100 times faster than doing nothing. Using squid is about 1000 times faster. (Doing both wouldn't necessarily help very much).
I'm sure we could squeeze a bit more performance out of it if we used Varnish, but probably not an order of magnitude and I don't think it is worth the administrative burden.
If we set up a production Squid instance (or farm), with a bare minimum of work it can cache for any number of sites, not just CKAN.
For the python coders, here's what you have to do to set the headers properly so that squid will cache the page:
A further advantage is that the *browsers* will also understand these cache-control headers and do their own caching - just setting them properly without even using Squid should result in some subjective performance improvements.
That's all for now, I suggest we dedicate a machine to just running squid, the more RAM the better and big discs are good, and put it between the world and the ckans. Oh, and comb through the controllers setting the headers correctly where appropriate...