Ticket #1641 (closed enhancement: fixed)
ckanext-archiver: Content-length header not reliable to check if resource has been modified
Reported by: | amercader | Owned by: | amercader |
---|---|---|---|
Priority: | minor | Milestone: | ckan-sprint-2012-01-23 |
Component: | ckan | Keywords: | storage, archiver |
Cc: | johnglover, kindly | Repository: | ckan |
Theme: | none |
Description
The download task in ckanext-archiver performs a HEAD request on the resource URL and checks if the "Content-Type" and "Content-Length" headers differ from the values stored to see if the resource needs to be updated [1].
The "Content-Length" header, although widely used, is not mandatory and some servers don't provide it, e.g.:
$ curl -I http://portfolio.theglobalfund.org/en/IATI/Activities?countryCode=AFG HTTP/1.1 200 OK Cache-Control: private Transfer-Encoding: chunked Content-Type: text/xml Vary: Accept-Encoding Server: Microsoft-IIS/7.5 Set-Cookie: ASP.NET_SessionId=3qhqekddgmre0kmk5cynq0sy; path=/; HttpOnly X-AspNetMvc-Version: 3.0 content-disposition: attachment; filename=AFG_IATI_12012012.xml X-AspNet-Version: 4.0.30319 X-Powered-By: ASP.NET Date: Thu, 12 Jan 2012 12:36:43 GMT
Also worth noting that requests, the python library that uses ckanext-archiver, sets an "Accept-Encoding: gzip" header by default, which depending on the configuration of the remote web server, may prevent the "Content-Length" server from being sent, e.g.:
$ curl -H "Accept-Encoding: gzip" -I http://iatistandard.org/published-temp/adb-activities.xml HTTP/1.1 200 OK Date: Thu, 12 Jan 2012 12:12:46 GMT Server: Apache Last-Modified: Mon, 28 Nov 2011 15:55:35 GMT Accept-Ranges: bytes Vary: Accept-Encoding Content-Encoding: gzip Content-Type: application/xml curl -I http://iatistandard.org/published-temp/adb-activities.xml HTTP/1.1 200 OK Date: Thu, 12 Jan 2012 11:56:23 GMT Server: Apache Last-Modified: Mon, 28 Nov 2011 15:55:35 GMT Accept-Ranges: bytes Content-Length: 2686720 Vary: Accept-Encoding Content-Type: application/xml
All this can lead to some resources never getting updated, and of course the size property of the resource not being set.
As we need to download the resource anyway, it would be better to check if the real length of the data has been modified (and store it).
[1] https://github.com/okfn/ckanext-archiver/blob/0a189262dca4ab5b286fb6a02b4ab8a201f639f3/ckanext/archiver/tasks.py#L72
Fixed on 77fa6483c32