<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>CKAN: Ticket #891: Resource download worker daemon</title>
    <link>http://localhost/ticket/891</link>
    <description>&lt;p&gt;
Superticket: &lt;a class="closed ticket" href="http://localhost/ticket/1397" title="enhancement: [super] Resource archiving (closed: fixed)"&gt;#1397&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
Write a worker daemon to download all resources from a CKAN instance to an OFS repository.
&lt;/p&gt;
&lt;h2 id="Questions"&gt;Questions&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues)
&lt;/li&gt;&lt;li&gt;Should we have clever ways to dump APIs? ANS: no.
&lt;/li&gt;&lt;li&gt;Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving)
&lt;/li&gt;&lt;li&gt;Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache.
&lt;ul&gt;&lt;li&gt;Complete support for ETags
&lt;/li&gt;&lt;li&gt;Expires, Max-Age etc.
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Check
&lt;/li&gt;&lt;/ul&gt;&lt;h2 id="Functionality"&gt;Functionality&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;Download files via HTTP, HTTPS (will not do FTP)
&lt;/li&gt;&lt;li&gt;Respect robots.txt
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
Process:
&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;[Archiver.Update checks queue (automated as part of celery)]
&lt;/li&gt;&lt;li&gt;Open url and get any info from resource on cache / content-length etc
&lt;ol&gt;&lt;li&gt;If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
&lt;/li&gt;&lt;li&gt;Check headers for content-length and content-type ...
&lt;ul&gt;&lt;li&gt;IF: content-length &amp;gt; max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
&lt;/li&gt;&lt;li&gt;ELSE: check content-type.
&lt;ul&gt;&lt;li&gt;IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
&lt;/li&gt;&lt;li&gt;ELSE: archive it (compute md5 hash etc)
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;IF: get hash from headers and hash unchanged GOTO step 4
&lt;/li&gt;&lt;li&gt;IF: get content-length and content-length unchanged GOTO step 4
&lt;/li&gt;&lt;li&gt;IF: max-age / expires / other cache headers show this has not changed since last check GOTO step 4
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li&gt;Archive it: connect to storage system and store it. Bucket: from config, Key: /archive/{timestamp}/{resourceid}/filename.ext
&lt;ul&gt;&lt;li&gt;Add cache url to resource and updated date
&lt;/li&gt;&lt;li&gt;Add other relevant info to resource such as md5, content-type etc
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Update task_status
&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;
&lt;/p&gt;
&lt;h2 id="Optionalfunctionality"&gt;Optional functionality&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;If result object is HTML, search for references to "proper data" (CSV download pages etc.)
&lt;/li&gt;&lt;li&gt;Download from POST forms (accepting licenses or weird proprietary systems)
&lt;/li&gt;&lt;li&gt;Support running on Google Apps Engine to save traffic costs.
&lt;/li&gt;&lt;/ul&gt;&lt;h2 id="Existingwork"&gt;Existing work&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;&lt;a class="ext-link" href="https://bitbucket.org/okfn/ckanext-qa/overview"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://bitbucket.org/okfn/ckanext-qa/overview&lt;/a&gt;
&lt;/li&gt;&lt;li&gt;out of date: &lt;a class="ext-link" href="https://bitbucket.org/pudo/ckanextarchive"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://bitbucket.org/pudo/ckanextarchive&lt;/a&gt; - Old archiver extension, largely experimental.
&lt;/li&gt;&lt;li&gt;out of date: &lt;a class="ext-link" href="https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65&lt;/a&gt; - Openness scores by ollyc
&lt;/li&gt;&lt;/ul&gt;</description>
    <language>en-us</language>
    <image>
      <title>CKAN</title>
      <url>http://assets.okfn.org/p/ckan/img/ckan_logo_shortname.png</url>
      <link>http://localhost/ticket/891</link>
    </image>
    <generator>Trac 0.12.3</generator>
    <item>
      
        <dc:creator>pudo</dc:creator>

      <pubDate>Mon, 03 Jan 2011 11:18:10 GMT</pubDate>
      <title>description changed</title>
      <link>http://localhost/ticket/891#comment:1</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:1</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=1"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>pudo</dc:creator>

      <pubDate>Mon, 03 Jan 2011 11:21:32 GMT</pubDate>
      <title>milestone changed</title>
      <link>http://localhost/ticket/891#comment:2</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:2</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                changed from &lt;em&gt;ckan-v1.3&lt;/em&gt; to &lt;em&gt;iati-4&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Thu, 13 Oct 2011 18:13:28 GMT</pubDate>
      <title>status, description changed; repo, theme, milestone, owner set</title>
      <link>http://localhost/ticket/891#comment:3</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:3</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;assigned&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=3"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;repo&lt;/strong&gt;
                set to &lt;em&gt;ckan&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;theme&lt;/strong&gt;
                set to &lt;em&gt;none&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                set to &lt;em&gt;ckan-v1.6&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              set to &lt;em&gt;kindly&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 14 Oct 2011 13:46:02 GMT</pubDate>
      <title>description changed</title>
      <link>http://localhost/ticket/891#comment:4</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:4</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=4"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 14 Oct 2011 14:22:08 GMT</pubDate>
      <title>milestone changed</title>
      <link>http://localhost/ticket/891#comment:5</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:5</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                changed from &lt;em&gt;ckan-v1.6&lt;/em&gt; to &lt;em&gt;ckan-sprint-2011-10-24&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
May only do link-checker and not do full storage in this sprint.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 14 Oct 2011 14:22:55 GMT</pubDate>
      <title>description changed</title>
      <link>http://localhost/ticket/891#comment:6</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:6</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=6"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Wed, 19 Oct 2011 10:39:39 GMT</pubDate>
      <title>owner, description changed; keywords set</title>
      <link>http://localhost/ticket/891#comment:7</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:7</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;keywords&lt;/strong&gt;
              &lt;em&gt;queue&lt;/em&gt; added
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;kindly&lt;/em&gt; to &lt;em&gt;johnglover&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=7"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Tue, 01 Nov 2011 10:30:24 GMT</pubDate>
      <title>milestone changed</title>
      <link>http://localhost/ticket/891#comment:8</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:8</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                changed from &lt;em&gt;ckan-sprint-2011-10-24&lt;/em&gt; to &lt;em&gt;current-ckan-sprint-2011-11-07&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Almost finished (see &lt;a class="ext-link" href="http://github.com/okfn/ckanext-archiver"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://github.com/okfn/ckanext-archiver&lt;/a&gt;).
&lt;/p&gt;
&lt;p&gt;
Still to address:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;check headers to see if hash / cache / max-age / expires indicates that the resource does not need to be downloaded.
&lt;/li&gt;&lt;li&gt;add cache url to resource
&lt;/li&gt;&lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Tue, 01 Nov 2011 12:17:21 GMT</pubDate>
      <title>status, description changed; resolution set</title>
      <link>http://localhost/ticket/891#comment:9</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:9</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;assigned&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;fixed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=9"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Added cache_url and cache_last_updated to resources after archiving.
&lt;/p&gt;
&lt;p&gt;
Not checking for hash value in headers. This process will generally only run when a new resource is added or someone updates a URL, so we don't expect to be regularly downloading the same resource.
&lt;/p&gt;
&lt;p&gt;
We will need something along these lines if this is running as a regular cron job, but in that case the logic will be added to the cron job itself (probably a paster command).
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>