Ticket #987 (closed defect: duplicate)

Opened 3 years ago

Last modified 3 years ago

Common harvesting framework

Reported by: pudo Owned by: pudo
Priority: major Milestone:
Component: lod2 Keywords:
Cc: Repository: ckan
Theme: none


We are now harvesting metadata from other sources in various places around CKAN. Such harvesting can include:

  • CSW/WFS for INSPIRE/UKLII (yields CKAN packages)
  • Catalogue scraping for LOD2 experiments (yields RDF graphs)
  • Atom/DCat for LOD2 production (yields RDF graphs)
  • OAI-PMH for http://datadryad.org/ and other dspace (yields CKAN packages)

We should aim to consolidate the harvesting clients into a common system that is easy to extend when needed and can be re-used in different scenarios.

In general, such a system would have the following stages:

  • Source selection: find what to download/scrape/harvest/parse
  • Index retrieval (i.e. package index)
  • Item retrieval (i.e. package entity)
  • (Optional: Serialization)
  • Normalisation
  • Loading/Merging? into CKAN

Exisiting harvesters are at:

Change History

comment:1 Changed 3 years ago by thejimmyg

  • Repository set to ckan
  • Status changed from new to closed
  • Theme set to none
  • Resolution set to duplicate

This has largely been implemented now for publicdata.eu. There is a CREP #1134 outstanding to take the harvesting to the next level so marking this one as duplicate for now.

Note: See TracTickets for help on using tickets.