Ticket #987 (closed defect: duplicate)
Common harvesting framework
Reported by: | pudo | Owned by: | pudo |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | lod2 | Keywords: | |
Cc: | Repository: | ckan | |
Theme: | none |
Description
We are now harvesting metadata from other sources in various places around CKAN. Such harvesting can include:
- CSW/WFS for INSPIRE/UKLII (yields CKAN packages)
- Catalogue scraping for LOD2 experiments (yields RDF graphs)
- Atom/DCat for LOD2 production (yields RDF graphs)
- OAI-PMH for http://datadryad.org/ and other dspace (yields CKAN packages)
We should aim to consolidate the harvesting clients into a common system that is easy to extend when needed and can be re-used in different scenarios.
In general, such a system would have the following stages:
- Source selection: find what to download/scrape/harvest/parse
- Index retrieval (i.e. package index)
- Item retrieval (i.e. package entity)
- (Optional: Serialization)
- Normalisation
- Loading/Merging? into CKAN
Exisiting harvesters are at:
Change History
Note: See
TracTickets for help on using
tickets.
This has largely been implemented now for publicdata.eu. There is a CREP #1134 outstanding to take the harvesting to the next level so marking this one as duplicate for now.