Ticket #891 (closed task: fixed) — at Version 9
Resource download worker daemon
Reported by: | pudo | Owned by: | johnglover |
---|---|---|---|
Priority: | critical | Milestone: | ckan-sprint-2011-11-07 |
Component: | ckan | Keywords: | queue |
Cc: | Repository: | ckan | |
Theme: | none |
Description (last modified by johnglover) (diff)
Superticket: #1397
Write a worker daemon to download all resources from a CKAN instance to a local repository.
Questions
- Do we only want to download openly licensed information? ANS: no, we do everything (though do need to think about this re. IP issues)
- Should we have clever ways to dump APIs? ANS: no.
- Do we respect robots.txt even for openly licensed information? ANS: No (we're not crawling we're archiving)
- Use HTTP/1.1 Caching headers? ANS: if not changed since we last updated don't bother to recache.
- Complete support for ETags
- Expires, Max-Age etc.
- Check
Functionality
- Download files via HTTP, HTTPS (will not do FTP)
Process:
- [Archiver.Update checks queue (automated as part of celery)]
- Open url and get any info from resource on cache / content-length etc
- If FAILURE status: update task_status table (could retry if not more than 3 failures so far). Report task failure in celery
- Check headers for content-length and content-type ...
- IF: content-length > max_content_length: EXIT (store outcomes on task_status, and update resource with size and content-type and any other info we get?)
- ELSE: check content-type.
- IF: NOT data stuff (e.g. text/html) then EXIT. (store outcomes and info on resource)
- ELSE: archive it (compute md5 hash etc)
- IF: get content-length and content-length unchanged GOTO step 4
- Archive it: connect to storage system and store it. Bucket: from config, Key: /archive/{timestamp}/{resourceid}/filename.ext
- Add cache url to resource and updated date
- Add other relevant info to resource such as md5, content-type etc
- Update task_status
Optional functionality
- If result object is HTML, search for references to "proper data" (CSV download pages etc.)
- Download from POST forms (accepting licenses or weird proprietary systems)
- Support running on Google Apps Engine to save traffic costs.
Existing work
- https://bitbucket.org/okfn/ckanext-qa/overview
- out of date: https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.
- out of date: https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc
Change History
comment:3 Changed 3 years ago by rgrp
- Status changed from new to assigned
- Description modified (diff)
- Repository set to ckan
- Theme set to none
- Milestone set to ckan-v1.6
- Owner set to kindly
comment:5 Changed 3 years ago by rgrp
- Milestone changed from ckan-v1.6 to ckan-sprint-2011-10-24
May only do link-checker and not do full storage in this sprint.
comment:7 Changed 3 years ago by johnglover
- Keywords queue added
- Owner changed from kindly to johnglover
- Description modified (diff)
comment:8 Changed 3 years ago by johnglover
- Milestone changed from ckan-sprint-2011-10-24 to current-ckan-sprint-2011-11-07
Almost finished (see http://github.com/okfn/ckanext-archiver).
Still to address:
- check headers to see if hash / cache / max-age / expires indicates that the resource does not need to be downloaded.
- add cache url to resource
comment:9 Changed 3 years ago by johnglover
- Status changed from assigned to closed
- Resolution set to fixed
- Description modified (diff)
Added cache_url and cache_last_updated to resources after archiving.
Not checking for hash value in headers. This process will generally only run when a new resource is added or someone updates a URL, so we don't expect to be regularly downloading the same resource.
We will need something along these lines if this is running as a regular cron job, but in that case the logic will be added to the cron job itself (probably a paster command).