Ticket #891 (closed task: fixed) — at Version 9

Opened 3 years ago

Last modified 3 years ago

Resource download worker daemon

Reported by: pudo Owned by: johnglover
Priority: critical Milestone: ckan-sprint-2011-11-07
Component: ckan Keywords: queue
Cc: Repository: ckan
Theme: none

Description (last modified by johnglover) (diff)

Superticket: #1397

Write a worker daemon to download all resources from a CKAN instance to a local repository.

Questions

  • Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues)
  • Should we have clever ways to dump APIs? ANS: no.
  • Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving)
  • Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache.
    • Complete support for ETags
    • Expires, Max-Age etc.
  • Check

Functionality

  • Download files via HTTP, HTTPS (will not do FTP)

Process:

  1. [Archiver.Update checks queue (automated as part of celery)]
  2. Open url and get any info from resource on cache / content-length etc
    1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
    2. Check headers for content-length and content-type ...
      • IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
      • ELSE: check content-type.
        • IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
        • ELSE: archive it (compute md5 hash etc)
      • IF: get content-length and content-length unchanged GOTO step 4
  3. Archive it: connect to storage system and store it. Bucket: from config, Key: /archive/{timestamp}/{resourceid}/filename.ext
    • Add cache url to resource and updated date
    • Add other relevant info to resource such as md5, content-type etc
  4. Update task_status

Optional functionality

  • If result object is HTML, search for references to "proper data" (CSV download pages etc.)
  • Download from POST forms (accepting licenses or weird proprietary systems)
  • Support running on Google Apps Engine to save traffic costs.

Existing work

Change History

comment:1 Changed 3 years ago by pudo

  • Description modified (diff)

comment:2 Changed 3 years ago by pudo

  • Milestone changed from ckan-v1.3 to iati-4

comment:3 Changed 3 years ago by rgrp

  • Status changed from new to assigned
  • Description modified (diff)
  • Repository set to ckan
  • Theme set to none
  • Milestone set to ckan-v1.6
  • Owner set to kindly

comment:4 Changed 3 years ago by rgrp

  • Description modified (diff)

comment:5 Changed 3 years ago by rgrp

  • Milestone changed from ckan-v1.6 to ckan-sprint-2011-10-24

May only do link-checker and not do full storage in this sprint.

comment:6 Changed 3 years ago by rgrp

  • Description modified (diff)

comment:7 Changed 3 years ago by johnglover

  • Keywords queue added
  • Owner changed from kindly to johnglover
  • Description modified (diff)

comment:8 Changed 3 years ago by johnglover

  • Milestone changed from ckan-sprint-2011-10-24 to current-ckan-sprint-2011-11-07

Almost finished (see http://github.com/okfn/ckanext-archiver).

Still to address:

  • check headers to see if hash / cache / max-age / expires indicates that the resource does not need to be downloaded.
  • add cache url to resource

comment:9 Changed 3 years ago by johnglover

  • Status changed from assigned to closed
  • Resolution set to fixed
  • Description modified (diff)

Added cache_url and cache_last_updated to resources after archiving.

Not checking for hash value in headers. This process will generally only run when a new resource is added or someone updates a URL, so we don't expect to be regularly downloading the same resource.

We will need something along these lines if this is running as a regular cron job, but in that case the logic will be added to the cron job itself (probably a paster command).

Note: See TracTickets for help on using tickets.