Ticket #1037 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

More Robust Harvesting for DGU

Reported by: thejimmyg Owned by: amercader
Priority: major Milestone: ckan-v1.4-sprint-6
Component: uklii Keywords:
Cc: Repository: ckan
Theme: none


CKAN's harvesting facility is now live on DGU but there are some major improvements that could be made to make it more robust and better fit the generic CKAN harvesting framework proposed in #987.

Some of the key issues:

  • Error reports do not currently contain the ID or title of the document with the error.
  • We only have "added" and "error" logging on jobs when we really need a report of "added", "updated", "not changed" and "errors" with the items in each referencing a real metadata document for which harvesting was attempted
  • We need deletion and editing of sources, without deleting the harvested documents or packages
  • We need a more robust harvesting mechanism than a cron job or we need to deal with the case of multiple cron jobs running at once.
  • We need to know the last time a list of documents was scheduled for harvest and the last time each one was fetched.

Change History

comment:1 Changed 3 years ago by thejimmyg

  • Repository set to ckan
  • Theme set to none
  • Milestone set to ckan-v1.4-sprint-5

comment:2 Changed 3 years ago by thejimmyg

  • Owner changed from thejimmyg to amercader

comment:3 Changed 3 years ago by thejimmyg

  • Milestone changed from ckan-v1.4-sprint-5 to ckan-v1.4-sprint-6

We spent last week integrating the new harvesting architecture and testing the code but there are still some areas that need looking at

  • The source type and label should be part of the plugin, not named in DGU.
  • Need warnings if a document changes but its date doesn't -> do we have these?
  • I noticed there are some tests in DGU, should these perhaps be in ckanext-harvest?
  • If active is False, the job should not be put on the queue
  • Log if the wrong type of URL is entered as an error the user can see
  • Deny if the source is already registered
  • Overwrite all extras, not just merge new ones.
  • During the import stage use iswms.py to add an extra during import if it is a WMS so that we can add a link to the WMS later https://gist.github.com/900878
  • Can errors/warnings be logged in the import stage? Do all fetched documents get passed to import in one go?

comment:4 Changed 3 years ago by thejimmyg

  • Status changed from new to closed
  • state set to draft
  • Resolution set to fixed

Closing this now, any outstanding small issues will be logged in new tickets.

Note: See TracTickets for help on using tickets.