Context Navigation

← Previous Ticket
Next Ticket →

Ticket #891 (closed task: fixed) — at Version 9

Opened 3 years ago

Last modified 3 years ago

Resource download worker daemon

Reported by:	pudo	Owned by:	johnglover
Priority:	critical	Milestone:	ckan-sprint-2011-11-07
Component:	ckan	Keywords:	queue
Cc:		Repository:	ckan
Theme:	none

Description (last modified by johnglover) (diff)

Superticket: #1397

Write a worker daemon to download all resources from a CKAN instance to a local repository.

Questions

Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues)
Should we have clever ways to dump APIs? ANS: no.
Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving)
Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache.
- Complete support for ETags
- Expires, Max-Age etc.
Check

Functionality

Download files via HTTP, HTTPS (will not do FTP)

Process:

[Archiver.Update checks queue (automated as part of celery)]
Open url and get any info from resource on cache / content-length etc
1. If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
2. Check headers for content-length and content-type ...
  - IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
  - ELSE: check content-type.
    - IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
    - ELSE: archive it (compute md5 hash etc)
  - IF: get content-length and content-length unchanged GOTO step 4
Archive it: connect to storage system and store it. Bucket: from config, Key: /archive/{timestamp}/{resourceid}/filename.ext
- Add cache url to resource and updated date
- Add other relevant info to resource such as md5, content-type etc
Update task_status

Optional functionality

If result object is HTML, search for references to "proper data" (CSV download pages etc.)
Download from POST forms (accepting licenses or weird proprietary systems)
Support running on Google Apps Engine to save traffic costs.

Existing work

https://bitbucket.org/okfn/ckanext-qa/overview
out of date: https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.
out of date: https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc

Change History

comment:1 Changed 3 years ago by pudo

Description modified (diff)

comment:2 Changed 3 years ago by pudo

Milestone changed from ckan-v1.3 to iati-4

comment:3 Changed 3 years ago by rgrp

Status changed from new to assigned
Description modified (diff)
Repository set to ckan
Theme set to none
Milestone set to ckan-v1.6
Owner set to kindly

comment:4 Changed 3 years ago by rgrp

Description modified (diff)

comment:5 Changed 3 years ago by rgrp

Milestone changed from ckan-v1.6 to ckan-sprint-2011-10-24

May only do link-checker and not do full storage in this sprint.

comment:6 Changed 3 years ago by rgrp

Description modified (diff)

comment:7 Changed 3 years ago by johnglover

Keywords queue added
Owner changed from kindly to johnglover
Description modified (diff)

comment:8 Changed 3 years ago by johnglover

Milestone changed from ckan-sprint-2011-10-24 to current-ckan-sprint-2011-11-07

Almost finished (see http://github.com/okfn/ckanext-archiver).

Still to address:

check headers to see if hash / cache / max-age / expires indicates that the resource does not need to be downloaded.
add cache url to resource

comment:9 Changed 3 years ago by johnglover

Status changed from assigned to closed
Resolution set to fixed
Description modified (diff)

Added cache_url and cache_last_updated to resources after archiving.

Not checking for hash value in headers. This process will generally only run when a new resource is added or someone updates a URL, so we don't expect to be regularly downloading the same resource.

We will need something along these lines if this is running as a regular cron job, but in that case the logic will be added to the cron job itself (probably a paster command).

Note: See TracTickets for help on using tickets.

Download in other formats: