Changes between Version 2 and Version 3 of Ticket #1397

10/14/11 13:34:24 (3 years ago)


  • Ticket #1397 – Description

    v2 v3  
    11We want to cache/archive data associated to a resource so it is available if the resource url disappears (and in order to support other processing we may wish to do e.g. webstorer ...) 
    3 TODO: (@kindly: inline material etherpad info (or ref) from planning session this week) 
     3Etherpad: (most relevant parts inlined here) 
    5  * Resource change notifications in core - #1383 
    6  * Resource download worker daemon - #891 
    7  * Make archived data available in WUI - #892 
    8  * Introduce timed actions into ckanext-queue - #890 
     5== Preliminaries == 
     7 * Add task_status table to store qa/archiever/webstore information that does not need to be versioned. - #1363 (and #1371 - related logic functions) 
     9== Tasks == 
     11 1. Resource change notifications in core - Make an IResourceChange and IResourceUrlChange. [1d] [0.75d] -  #1383 
     12 2. ckanext-archiver implements IResourceUrlChange and sends tasks to celery. [0.25d][0.25d] - ??? 
     13 3. Archiver daemon #891 
     14   1. implement link-check function and task (point 2 from Archiver.update above) [1d] [0.5d] 
     15   2. Rewrite archiver to use external storage. (decide how!)[3d][~2d] 
     16 5. Write to resource and task status table.[1d][0.75d] 
     17 6. Make archived data available in WUI - #892 
     19== Archiver process == 
     23 0. A resource is added to CKAN 
     24 1. IResourceCreate event generated 
     25 2. IF: resource url points to ckan storage or falls within some other set of exclusion conditions then END else continue 
     26 3. Generate a Archiver.Update task with 
     30 1. [Archiver.Update checks queue (automated as part of celery)] 
     31 2. Open url 
     32  1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery 
     33  2. Check headers for content-length and content-type ... 
     34    * IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?) 
     35    * ELSE: check content-type. 
     36      * IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource) 
     37      * ELSE: archive it (compute md5 hash etc) 
     38 3. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext 
     39  * Add cache url to resource and updated date 
     40  * Update task_status 
     41  * Add other relevant info to resource such as md5, content-type etc 
     43Link checker: same as Archiver.update up to 2.1