Ticket #540 (closed story: fixed)

Opened 4 years ago

Last modified 3 years ago

Implement caching in a systematic manner

Reported by: rgrp Owned by: wwaites
Priority: awaiting triage Milestone:
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none


Change History

comment:1 Changed 4 years ago by wwaites

Cut-and-paste from ckan-discuss:

I had a look at Varnish and I agree that the configuration language is complicated. In fact by default Varnish disregards cache control headers and in general behaves in a very standards non-compliant way. I have no doubt that it is very fast -- if you are willing to spend the efford to customise its configuration for the exact layout of pages and headers and such that each web site it is going to be used with will use. In other words, there is a large administrative burden.

So I decided to change tack and see where the Squid proxy has gotten to in the decade or so since I last met it. Squid is a general purpose caching proxy that can be configured as an http accelerator. The configuration is simple. You tell it where your web servers are for which sites. The web servers make sure to set the cache control headers appropriately.

Here are some results from my testing, against http://de.ckan.net/package/list?page=B which is an example of a slow page. Except for the first, which only did 100 requests, the tests were set to 8 simultaneous connections and a total of 1000 requests.

No caching of any kind:
    Requests per second:    0.44 [#/sec] (mean)

Beaker Cache (filesystem):
    Requests per second:    43.16 [#/sec] (mean)

SQUID setting cache control headers correctly:
    Requests per second:    421.33 [#/sec] (mean)

The results are clear. Using the application cache is about 100 times faster than doing nothing. Using squid is about 1000 times faster. (Doing both wouldn't necessarily help very much).

I'm sure we could squeeze a bit more performance out of it if we used Varnish, but probably not an order of magnitude and I don't think it is worth the administrative burden.

If we set up a production Squid instance (or farm), with a bare minimum of work it can cache for any number of sites, not just CKAN.

For the python coders, here's what you have to do to set the headers properly so that squid will cache the page:

       del response.headers["Pragma"]
       del response.headers["Cache-Control"]
       from time import gmtime, strftime
       response.headers["Last-Modified"] = strftime("%a, %d %b %Y
%H:%M:%S GMT", gmtime())

A further advantage is that the *browsers* will also understand these cache-control headers and do their own caching - just setting them properly without even using Squid should result in some subjective performance improvements.

That's all for now, I suggest we dedicate a machine to just running squid, the more RAM the better and big discs are good, and put it between the world and the ckans. Oh, and comb through the controllers setting the headers correctly where appropriate...

comment:2 Changed 3 years ago by dread

  • Priority set to awaiting triage
  • Component set to ckan

comment:3 Changed 3 years ago by dread

  • Repository set to ckan
  • Status changed from new to closed
  • Theme set to none
  • Resolution set to fixed

Closing - all the suggestions have been implemented: squid instance and cache headers set for high traffic pages.

Note: See TracTickets for help on using tickets.