<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>CKAN: Ticket #891: Resource download worker daemon</title>
    <link>http://localhost/ticket/891</link>
    <description>&lt;p&gt;
Superticket: &lt;a class="closed ticket" href="http://localhost/ticket/1397" title="enhancement: [super] Resource archiving (closed: fixed)"&gt;#1397&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
Write a worker daemon to download all resources from a CKAN instance to an OFS repository.
&lt;/p&gt;
&lt;h2 id="Openquestions"&gt;Open questions&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;Do we only want to download openly licensed information?
&lt;/li&gt;&lt;li&gt;Should we have clever ways to dump APIs?
&lt;/li&gt;&lt;li&gt;Do we respect robots.txt even for openly licensed information?
&lt;/li&gt;&lt;/ul&gt;&lt;h2 id="Functionality"&gt;Functionality&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;Download files via HTTP, HTTPS and, optionally FTP.
&lt;/li&gt;&lt;li&gt;Respect HTTP/1.1 Caching headers:
&lt;ul&gt;&lt;li&gt;Complete support for ETags
&lt;/li&gt;&lt;li&gt;Expires, Max-Age etc.
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Handle errors, classify as temporary or permanent
&lt;/li&gt;&lt;li&gt;Respect robots.txt
&lt;/li&gt;&lt;/ul&gt;&lt;h2 id="Optionalfunctionality"&gt;Optional functionality&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;If result object is HTML, search for references to "proper data" (CSV download pages etc.)
&lt;/li&gt;&lt;li&gt;Download from POST forms (accepting licenses or weird proprietary systems)
&lt;/li&gt;&lt;li&gt;Support running on Google Apps Engine to save traffic costs.
&lt;/li&gt;&lt;/ul&gt;&lt;h2 id="Existingwork"&gt;Existing work&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;&lt;a class="ext-link" href="https://bitbucket.org/pudo/ckanextarchive"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://bitbucket.org/pudo/ckanextarchive&lt;/a&gt; - Old archiver extension, largely experimental.
&lt;/li&gt;&lt;li&gt;&lt;a class="ext-link" href="https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65&lt;/a&gt; - Openness scores by ollyc
&lt;/li&gt;&lt;/ul&gt;</description>
    <language>en-us</language>
    <image>
      <title>CKAN</title>
      <url>http://assets.okfn.org/p/ckan/img/ckan_logo_shortname.png</url>
      <link>http://localhost/ticket/891</link>
    </image>
    <generator>Trac 0.12.3</generator>
    <item>
      
        <dc:creator>pudo</dc:creator>

      <pubDate>Mon, 03 Jan 2011 11:18:10 GMT</pubDate>
      <title>description changed</title>
      <link>http://localhost/ticket/891#comment:1</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:1</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=1"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>pudo</dc:creator>

      <pubDate>Mon, 03 Jan 2011 11:21:32 GMT</pubDate>
      <title>milestone changed</title>
      <link>http://localhost/ticket/891#comment:2</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:2</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                changed from &lt;em&gt;ckan-v1.3&lt;/em&gt; to &lt;em&gt;iati-4&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Thu, 13 Oct 2011 18:13:28 GMT</pubDate>
      <title>status, description changed; repo, theme, milestone, owner set</title>
      <link>http://localhost/ticket/891#comment:3</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:3</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;assigned&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=3"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;repo&lt;/strong&gt;
                set to &lt;em&gt;ckan&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;theme&lt;/strong&gt;
                set to &lt;em&gt;none&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                set to &lt;em&gt;ckan-v1.6&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              set to &lt;em&gt;kindly&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 14 Oct 2011 13:46:02 GMT</pubDate>
      <title>description changed</title>
      <link>http://localhost/ticket/891#comment:4</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:4</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=4"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 14 Oct 2011 14:22:08 GMT</pubDate>
      <title>milestone changed</title>
      <link>http://localhost/ticket/891#comment:5</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:5</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                changed from &lt;em&gt;ckan-v1.6&lt;/em&gt; to &lt;em&gt;ckan-sprint-2011-10-24&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
May only do link-checker and not do full storage in this sprint.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 14 Oct 2011 14:22:55 GMT</pubDate>
      <title>description changed</title>
      <link>http://localhost/ticket/891#comment:6</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:6</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=6"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Wed, 19 Oct 2011 10:39:39 GMT</pubDate>
      <title>owner, description changed; keywords set</title>
      <link>http://localhost/ticket/891#comment:7</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:7</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;keywords&lt;/strong&gt;
              &lt;em&gt;queue&lt;/em&gt; added
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;kindly&lt;/em&gt; to &lt;em&gt;johnglover&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=7"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Tue, 01 Nov 2011 10:30:24 GMT</pubDate>
      <title>milestone changed</title>
      <link>http://localhost/ticket/891#comment:8</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:8</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                changed from &lt;em&gt;ckan-sprint-2011-10-24&lt;/em&gt; to &lt;em&gt;current-ckan-sprint-2011-11-07&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Almost finished (see &lt;a class="ext-link" href="http://github.com/okfn/ckanext-archiver"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://github.com/okfn/ckanext-archiver&lt;/a&gt;).
&lt;/p&gt;
&lt;p&gt;
Still to address:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;check headers to see if hash / cache / max-age / expires indicates that the resource does not need to be downloaded.
&lt;/li&gt;&lt;li&gt;add cache url to resource
&lt;/li&gt;&lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Tue, 01 Nov 2011 12:17:21 GMT</pubDate>
      <title>status, description changed; resolution set</title>
      <link>http://localhost/ticket/891#comment:9</link>
      <guid isPermaLink="false">http://localhost/ticket/891#comment:9</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;assigned&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;fixed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/891?action=diff&amp;amp;version=9"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Added cache_url and cache_last_updated to resources after archiving.
&lt;/p&gt;
&lt;p&gt;
Not checking for hash value in headers. This process will generally only run when a new resource is added or someone updates a URL, so we don't expect to be regularly downloading the same resource.
&lt;/p&gt;
&lt;p&gt;
We will need something along these lines if this is running as a regular cron job, but in that case the logic will be added to the cron job itself (probably a paster command).
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>