Ticket #1397 (new enhancement) — at Version 3

Opened 3 years ago

Last modified 2 years ago

[super] Resource archiving

Reported by: rgrp Owned by: kindly
Priority: major Milestone: ckan-v1.7
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description (last modified by rgrp) (diff)

We want to cache/archive data associated to a resource so it is available if the resource url disappears (and in order to support other processing we may wish to do e.g. webstorer ...)

Etherpad: http://ckan.okfnpad.org/queue (most relevant parts inlined here)

Preliminaries

  • Add task_status table to store qa/archiever/webstore information that does not need to be versioned. - #1363 (and #1371 - related logic functions)

Tasks

  1. Resource change notifications in core - Make an IResourceChange and IResourceUrlChange. [1d] [0.75d] - #1383
  2. ckanext-archiver implements IResourceUrlChange and sends tasks to celery. [0.25d][0.25d] - ???
  3. Archiver daemon #891
    1. implement link-check function and task (point 2 from Archiver.update above) [1d] [0.5d]
    2. Rewrite archiver to use external storage. (decide how!)[3d][~2d]
  4. Write to resource and task status table.[1d][0.75d]
  5. Make archived data available in WUI - #892

Archiver process

Archiver:

  1. A resource is added to CKAN
  2. IResourceCreate event generated
  3. IF: resource url points to ckan storage or falls within some other set of exclusion conditions then END else continue
  4. Generate a Archiver.Update task with resource.id

Archiver.update

  1. [Archiver.Update checks queue (automated as part of celery)]
  2. Open url
    1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
    2. Check headers for content-length and content-type ...
      • IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
      • ELSE: check content-type.
        • IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
        • ELSE: archive it (compute md5 hash etc)
  3. Archive it: connect to storage system and store it. Bucket: from config, Key: /{timestamp}/{resourceid}/filename.ext
    • Add cache url to resource and updated date
    • Update task_status
    • Add other relevant info to resource such as md5, content-type etc

Link checker: same as Archiver.update up to 2.1

Change History

comment:1 Changed 3 years ago by rgrp

  • Description modified (diff)

comment:2 Changed 3 years ago by rgrp

  • Description modified (diff)

comment:3 Changed 3 years ago by rgrp

  • Description modified (diff)
Note: See TracTickets for help on using tickets.