Context Navigation

Changes between Version 3 and Version 4 of Ticket #891

Timestamp:: 10/14/11 13:46:02 (3 years ago)
Author:: rgrp
Comment:

Legend:

: Unmodified
: Added
: Removed
: Modified

Ticket #891 – Description

-                      v3
+                      v4
 Write a worker daemon to download all resources from a CKAN instance to an OFS repository.
 == Open questions ==
+== Questions ==
+ * Do we only want to download openly licensed information?
+ * Should we have clever ways to dump APIs?
+ * Do we respect robots.txt even for openly licensed information?
+ * Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues)
+ * Should we have clever ways to dump APIs? ANS: no.
+ * Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving)
+ * Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache.
+  * Complete support for ETags
+  * Expires, Max-Age etc.
+ * Check
 == Functionality ==
+ * Download files via HTTP, HTTPS and, optionally FTP.
+ * Respect HTTP/1.1 Caching headers:
+  * Complete support for ETags
+  * Expires, Max-Age etc.
+ * Handle errors, classify as temporary or permanent
+ * Download files via HTTP, HTTPS (will not do FTP)
  * Respect robots.txt
+Process:
+. [Archiver.Update checks queue (automated as part of celery)]
+. Open url and get any info from resource on cache / content-length etc
+. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
+. Check headers for content-length and content-type ...
+    * IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
+    * ELSE: check content-type.
+      * IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
+      * ELSE: archive it (compute md5 hash etc)
+    * IF: get hash from headers and hash unchanged GOTO step 4
+    * IF: get content-length and content-length unchanged GOTO step 4
+    * IF: max-age / expires / other cache headers show this has not changed since last check GOTO step 4
+. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext
+  * Add cache url to resource and updated date
+  * Add other relevant info to resource such as md5, content-type etc
+. Update task_status
 == Optional functionality ==
 …
 == Existing work ==
+ * https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.
+ * https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc
+ * https://bitbucket.org/okfn/ckanext-qa/overview
+ * out of date: https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.
+ * out of date: https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc