Ticket #1397 (new enhancement) — at Version 3
[super] Resource archiving
Reported by: | rgrp | Owned by: | kindly |
---|---|---|---|
Priority: | major | Milestone: | ckan-v1.7 |
Component: | ckan | Keywords: | |
Cc: | Repository: | ckan | |
Theme: | none |
Description (last modified by rgrp) (diff)
We want to cache/archive data associated to a resource so it is available if the resource url disappears (and in order to support other processing we may wish to do e.g. webstorer ...)
Etherpad: http://ckan.okfnpad.org/queue (most relevant parts inlined here)
Preliminaries
- Add task_status table to store qa/archiever/webstore information that does not need to be versioned. - #1363 (and #1371 - related logic functions)
Tasks
- Resource change notifications in core - Make an IResourceChange and IResourceUrlChange. [1d] [0.75d] - #1383
- ckanext-archiver implements IResourceUrlChange and sends tasks to celery. [0.25d][0.25d] - ???
- Archiver daemon #891
- implement link-check function and task (point 2 from Archiver.update above) [1d] [0.5d]
- Rewrite archiver to use external storage. (decide how!)[3d][~2d]
- Write to resource and task status table.[1d][0.75d]
- Make archived data available in WUI - #892
Archiver process
Archiver:
- A resource is added to CKAN
- IResourceCreate event generated
- IF: resource url points to ckan storage or falls within some other set of exclusion conditions then END else continue
- Generate a Archiver.Update task with resource.id
Archiver.update
- [Archiver.Update checks queue (automated as part of celery)]
- Open url
- If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
- Check headers for content-length and content-type ...
- IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
- ELSE: check content-type.
- IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
- ELSE: archive it (compute md5 hash etc)
- Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext
- Add cache url to resource and updated date
- Update task_status
- Add other relevant info to resource such as md5, content-type etc
Link checker: same as Archiver.update up to 2.1
Change History
Note: See
TracTickets for help on using
tickets.