Ticket #891 (new task) — at Version 1
Resource download worker daemon
Reported by: | pudo | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | ckan-sprint-2011-11-07 |
Component: | ckan | Keywords: | queue |
Cc: | Repository: | ckan | |
Theme: | none |
Description (last modified by pudo) (diff)
Write a worker daemon to download all resources from a CKAN instance to an OFS repository.
Open questions
- Do we only want to download openly licensed information?
- Should we have clever ways to dump APIs?
- Do we respect robots.txt even for openly licensed information?
Functionality
- Download files via HTTP, HTTPS and, optionally FTP.
- Respect HTTP/1.1 Caching headers:
- Complete support for ETags
- Expires, Max-Age etc.
- Handle errors, classify as temporary or permanent
- Respect robots.txt
Optional functionality
- If result object is HTML, search for references to "proper data" (CSV download pages etc.)
- Download from POST forms (accepting licenses or weird proprietary systems)
- Support running on Google Apps Engine to save traffic costs.
Existing work
- https://bitbucket.org/pudo/ckanextarchive - Old archiver extension, largely experimental.
- https://bitbucket.org/ollyc/ckan/changeset/1b16fbe9aa65 - Openness scores by ollyc
Note: See
TracTickets for help on using
tickets.