5 | | * Resource change notifications in core - #1383 |
6 | | * Resource download worker daemon - #891 |
7 | | * Make archived data available in WUI - #892 |
8 | | * Introduce timed actions into ckanext-queue - #890 |
| 5 | == Preliminaries == |
| 6 | |
| 7 | * Add task_status table to store qa/archiever/webstore information that does not need to be versioned. - #1363 (and #1371 - related logic functions) |
| 8 | |
| 9 | == Tasks == |
| 10 | |
| 11 | 1. Resource change notifications in core - Make an IResourceChange and IResourceUrlChange. [1d] [0.75d] - #1383 |
| 12 | 2. ckanext-archiver implements IResourceUrlChange and sends tasks to celery. [0.25d][0.25d] - ??? |
| 13 | 3. Archiver daemon #891 |
| 14 | 1. implement link-check function and task (point 2 from Archiver.update above) [1d] [0.5d] |
| 15 | 2. Rewrite archiver to use external storage. (decide how!)[3d][~2d] |
| 16 | 5. Write to resource and task status table.[1d][0.75d] |
| 17 | 6. Make archived data available in WUI - #892 |
| 18 | |
| 19 | == Archiver process == |
| 20 | |
| 21 | Archiver: |
| 22 | |
| 23 | 0. A resource is added to CKAN |
| 24 | 1. IResourceCreate event generated |
| 25 | 2. IF: resource url points to ckan storage or falls within some other set of exclusion conditions then END else continue |
| 26 | 3. Generate a Archiver.Update task with resource.id |
| 27 | |
| 28 | Archiver.update |
| 29 | |
| 30 | 1. [Archiver.Update checks queue (automated as part of celery)] |
| 31 | 2. Open url |
| 32 | 1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery |
| 33 | 2. Check headers for content-length and content-type ... |
| 34 | * IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?) |
| 35 | * ELSE: check content-type. |
| 36 | * IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource) |
| 37 | * ELSE: archive it (compute md5 hash etc) |
| 38 | 3. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext |
| 39 | * Add cache url to resource and updated date |
| 40 | * Update task_status |
| 41 | * Add other relevant info to resource such as md5, content-type etc |
| 42 | |
| 43 | Link checker: same as Archiver.update up to 2.1 |