<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>CKAN: Ticket #1641: ckanext-archiver: Content-length header not reliable to check if resource has been modified</title>
    <link>http://localhost/ticket/1641</link>
    <description>&lt;p&gt;
The download task in ckanext-archiver performs a HEAD request on the resource URL and checks if the "Content-Type" and "Content-Length" headers differ from the values stored to see if the resource needs to be updated &lt;a class="missing changeset" title="No default repository defined"&gt;[1]&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
The "Content-Length" header, although widely used, is not mandatory and some servers don't provide it, e.g.:
&lt;/p&gt;
&lt;pre class="wiki"&gt;$ curl -I http://portfolio.theglobalfund.org/en/IATI/Activities?countryCode=AFG
HTTP/1.1 200 OK
Cache-Control: private
Transfer-Encoding: chunked
Content-Type: text/xml
Vary: Accept-Encoding
Server: Microsoft-IIS/7.5
Set-Cookie: ASP.NET_SessionId=3qhqekddgmre0kmk5cynq0sy; path=/; HttpOnly
X-AspNetMvc-Version: 3.0
content-disposition: attachment; filename=AFG_IATI_12012012.xml
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 12 Jan 2012 12:36:43 GMT
&lt;/pre&gt;&lt;p&gt;
Also worth noting that &lt;a class="ext-link" href="http://docs.python-requests.org/"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;requests&lt;/a&gt;, the python library that uses ckanext-archiver, sets an "Accept-Encoding: gzip" header by default, which depending on the configuration of the remote web server, may prevent the "Content-Length" server from being sent, e.g.:
&lt;/p&gt;
&lt;pre class="wiki"&gt;$ curl -H "Accept-Encoding: gzip" -I http://iatistandard.org/published-temp/adb-activities.xml
HTTP/1.1 200 OK
Date: Thu, 12 Jan 2012 12:12:46 GMT
Server: Apache
Last-Modified: Mon, 28 Nov 2011 15:55:35 GMT
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Type: application/xml
curl -I http://iatistandard.org/published-temp/adb-activities.xml
HTTP/1.1 200 OK
Date: Thu, 12 Jan 2012 11:56:23 GMT
Server: Apache
Last-Modified: Mon, 28 Nov 2011 15:55:35 GMT
Accept-Ranges: bytes
Content-Length: 2686720
Vary: Accept-Encoding
Content-Type: application/xml
&lt;/pre&gt;&lt;p&gt;
All this can lead to some resources never getting updated, and of course the size property of the resource not being set.
&lt;/p&gt;
&lt;p&gt;
As we need to download the resource anyway, it would be better to check if the real length of the data has been modified (and store it).
&lt;/p&gt;
&lt;p&gt;
&lt;a class="missing changeset" title="No default repository defined"&gt;[1]&lt;/a&gt; &lt;a class="ext-link" href="https://github.com/okfn/ckanext-archiver/blob/0a189262dca4ab5b286fb6a02b4ab8a201f639f3/ckanext/archiver/tasks.py#L72"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://github.com/okfn/ckanext-archiver/blob/0a189262dca4ab5b286fb6a02b4ab8a201f639f3/ckanext/archiver/tasks.py#L72&lt;/a&gt;
&lt;/p&gt;
</description>
    <language>en-us</language>
    <image>
      <title>CKAN</title>
      <url>http://assets.okfn.org/p/ckan/img/ckan_logo_shortname.png</url>
      <link>http://localhost/ticket/1641</link>
    </image>
    <generator>Trac 0.12.3</generator>
    <item>
      
        <dc:creator>amercader</dc:creator>

      <pubDate>Thu, 12 Jan 2012 13:59:37 GMT</pubDate>
      <title>status changed; resolution set</title>
      <link>http://localhost/ticket/1641#comment:1</link>
      <guid isPermaLink="false">http://localhost/ticket/1641#comment:1</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;fixed&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Fixed on 77fa6483c32
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>