Changes between Version 3 and Version 4 of Ticket #891


Ignore:
Timestamp:
10/14/11 13:46:02 (3 years ago)
Author:
rgrp
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #891 – Description

    v3 v4  
    33Write a worker daemon to download all resources from a CKAN instance to an OFS repository.  
    44 
    5 == Open questions ==  
     5== Questions ==  
    66 
    7  * Do we only want to download openly licensed information?  
    8  * Should we have clever ways to dump APIs?  
    9  * Do we respect robots.txt even for openly licensed information?  
     7 * Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues) 
     8 * Should we have clever ways to dump APIs? ANS: no. 
     9 * Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving) 
     10 * Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache.  
     11  * Complete support for ETags 
     12  * Expires, Max-Age etc.  
     13 * Check  
    1014 
    1115== Functionality ==  
    1216 
    13  * Download files via HTTP, HTTPS and, optionally FTP.  
    14  * Respect HTTP/1.1 Caching headers:  
    15   * Complete support for ETags 
    16   * Expires, Max-Age etc.  
    17  * Handle errors, classify as temporary or permanent 
     17 * Download files via HTTP, HTTPS (will not do FTP) 
    1818 * Respect robots.txt  
    1919 
     20Process: 
     21 
     22 1. [Archiver.Update checks queue (automated as part of celery)] 
     23 2. Open url and get any info from resource on cache / content-length etc 
     24  1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery 
     25  2. Check headers for content-length and content-type ... 
     26    * IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?) 
     27    * ELSE: check content-type. 
     28      * IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource) 
     29      * ELSE: archive it (compute md5 hash etc) 
     30    * IF: get hash from headers and hash unchanged GOTO step 4 
     31    * IF: get content-length and content-length unchanged GOTO step 4 
     32    * IF: max-age / expires / other cache headers show this has not changed since last check GOTO step 4 
     33 3. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext 
     34  * Add cache url to resource and updated date 
     35  * Add other relevant info to resource such as md5, content-type etc 
     36 4. Update task_status 
     37   
    2038== Optional functionality == 
    2139 
     
    2644== Existing work ==  
    2745 
    28  * https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.  
    29  * https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc 
     46 * https://bitbucket.org/okfn/ckanext-qa/overview 
     47 * out of date: https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.  
     48 * out of date: https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc