Ticket #891 (assigned task) — at Version 3

Opened 3 years ago

Last modified 3 years ago

Resource download worker daemon

Reported by: pudo Owned by: kindly
Priority: critical Milestone: ckan-sprint-2011-11-07
Component: ckan Keywords: queue
Cc: Repository: ckan
Theme: none

Description (last modified by rgrp) (diff)

Superticket: #1397

Write a worker daemon to download all resources from a CKAN instance to an OFS repository.

Open questions

  • Do we only want to download openly licensed information?
  • Should we have clever ways to dump APIs?
  • Do we respect robots.txt even for openly licensed information?

Functionality

  • Download files via HTTP, HTTPS and, optionally FTP.
  • Respect HTTP/1.1 Caching headers:
    • Complete support for ETags
    • Expires, Max-Age etc.
  • Handle errors, classify as temporary or permanent
  • Respect robots.txt

Optional functionality

  • If result object is HTML, search for references to "proper data" (CSV download pages etc.)
  • Download from POST forms (accepting licenses or weird proprietary systems)
  • Support running on Google Apps Engine to save traffic costs.

Existing work

Change History

comment:1 Changed 3 years ago by pudo

  • Description modified (diff)

comment:2 Changed 3 years ago by pudo

  • Milestone changed from ckan-v1.3 to iati-4

comment:3 Changed 3 years ago by rgrp

  • Status changed from new to assigned
  • Description modified (diff)
  • Repository set to ckan
  • Theme set to none
  • Milestone set to ckan-v1.6
  • Owner set to kindly
Note: See TracTickets for help on using tickets.