id	summary	reporter	owner	description	type	status	priority	milestone	component	resolution	keywords	cc	repo	theme
1641	ckanext-archiver: Content-length header not reliable to check if resource has been modified	amercader	amercader	"The download task in ckanext-archiver performs a HEAD request on the resource URL and checks if the ""Content-Type"" and ""Content-Length"" headers differ from the values stored to see if the resource needs to be updated [1].

The ""Content-Length"" header, although widely used, is not mandatory and some servers don't provide it, e.g.:

{{{
$ curl -I http://portfolio.theglobalfund.org/en/IATI/Activities?countryCode=AFG
HTTP/1.1 200 OK
Cache-Control: private
Transfer-Encoding: chunked
Content-Type: text/xml
Vary: Accept-Encoding
Server: Microsoft-IIS/7.5
Set-Cookie: ASP.NET_SessionId=3qhqekddgmre0kmk5cynq0sy; path=/; HttpOnly
X-AspNetMvc-Version: 3.0
content-disposition: attachment; filename=AFG_IATI_12012012.xml
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 12 Jan 2012 12:36:43 GMT
}}}

Also worth noting that [http://docs.python-requests.org/ requests], the python library that uses ckanext-archiver, sets an ""Accept-Encoding: gzip"" header by default, which depending on the configuration of the remote web server, may prevent the ""Content-Length"" server from being sent, e.g.:

{{{
$ curl -H ""Accept-Encoding: gzip"" -I http://iatistandard.org/published-temp/adb-activities.xml
HTTP/1.1 200 OK
Date: Thu, 12 Jan 2012 12:12:46 GMT
Server: Apache
Last-Modified: Mon, 28 Nov 2011 15:55:35 GMT
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Type: application/xml

curl -I http://iatistandard.org/published-temp/adb-activities.xml
HTTP/1.1 200 OK
Date: Thu, 12 Jan 2012 11:56:23 GMT
Server: Apache
Last-Modified: Mon, 28 Nov 2011 15:55:35 GMT
Accept-Ranges: bytes
Content-Length: 2686720
Vary: Accept-Encoding
Content-Type: application/xml
}}}

All this can lead to some resources never getting updated, and of course the size property of the resource not being set.

As we need to download the resource anyway, it would be better to check if the real length of the data has been modified (and store it).



[1] https://github.com/okfn/ckanext-archiver/blob/0a189262dca4ab5b286fb6a02b4ab8a201f639f3/ckanext/archiver/tasks.py#L72"	enhancement	closed	minor	ckan-sprint-2012-01-23	ckan	fixed	storage, archiver	johnglover kindly	ckan	none
