Ticket #1134 (new CREP)

Opened 3 years ago

Last modified 23 months ago

CREP0003: Description and Configuration of Harvesters

Reported by: amercader Owned by:
Priority: major Milestone: ckan-backlog
Component: ckan Keywords: Harvesting
Cc: david.raznick@… Repository: ckanext-harvest
Theme: none


Proposer: Adrià Mercader


The new harvester interface allows to create harvesters for different sources, but right now harvesters don't have many ways to describe and configure themselves. We need a way of allowing them to:

  • Expose their type and other details so they can be used internally and on the UI.
  • Define configuration settings for particular harvester instances.

The Problem

Harvester description

The current UI for adding and editing harvest sources is the same used in ckanext-dgu, and thus the 3 harvester types used in DGU to harvest various GEMINI realted sources are hardcoded in the form. The form will be migrated to a DGU-independent one, so we need the harvesters to provide all the necessary data. There is a current get_type method that returns the harvester type, but for make it compatible with the DGU forms, it returns a machine-readable string (e.g. "CSW Server"), making it error prone.

Arbitrary configuration

In the current implementation, when the harvest process is started, ckanext-harvest looks for all the available plugins that implement the IHarvester interface and calls the appropiate methods for the current stage (gather_stage,fetch_stage,import_stage). At these stages, harvesters have no way of applying arbitrary configuration options, so all harvesters of the same type behave on the same way. For instance, the CKAN harvester needs a way to define the API version to use when harvesting remote instances (Right now, the version 2 is hardcoded on the code).


Harvester description

Harvesters will need to provide the following information so the UI form can be built:

  • name: machine-readable name (e.g. "waf"). This will be the value stored in the database, and the one used by ckanext-harvest to call the appropiate harvester.
  • title: human-readable name (e.g. "Web Accessible Folder (WAF)"). This will appear in the form's select box.
  • description: a description of what the harvester does (e.g. "A Web Accessible Folder (WAF) displaying a list of GEMINI 2.1 documents"). This will appear on the form as a guidance to the user.

The way to provide it will be an info method that all harvesters must implement, which will return a dictionary with the previous elements:

        'name': 'csw',
        'title': 'CSW Server',
        'description': 'A server that implements OGC's Catalog Service 
                        for the Web (CSW) standard'

Arbitrary configuration

As different harvesters will have very different needs, we need to provide a way to persist arbitrary configuration flags for each harvest source. The more flexible way given the current architecture in my opinion would be to store the configuration options as a JSON encoded object as a property of the harvest source (There already is an unused DB field called config in the database) (Maybe using JsonType??).

This will mean adding an extra field in the harvest source form to allow entering the configuration. This could be just a simple text field where users enter the JSON encoded object or a more clever mechanism (i.e an "Add a configuration flag" link that adds two new text fields for the key and value for each flag, and a mechanism to later build the JSON object). In any case, this should probably be hidden in an "Advance options" section.

Why do it this way

Harvester description

The info method would provide a single point to get all the information related to the harvester, and future properties could be added to the dictionary returned without having to modify the interface.

Arbitrary configuration

There is an already existing config field in the database, so we won't need to change the model. Harvesters could access the config object at any of the stages. Of course they could provide default values in their implementations so users don't need to enter them everytime.

Implementation plan


Risks and mitigations

The highest risk on the harvesters info method side is that harvester implementation don't offer one of the necessary properties (namely name and title). This could fire a warning when showing the UI form or using the CLI.


Adrià Mercader to do it.


None yet.

Change History

comment:1 Changed 3 years ago by thejimmyg

  • Cc david.raznick@… added
  • Priority changed from awaiting triage to major
  • state changed from draft to accepted

Great stuff.

Yes, I agree with all this, but as initial thoughts I suggest that for phase 1, the info() method returns one more key called "form_config_interface" which takes a string representing a the type of interface for the config field on the form. If the key is missing it is treated as "None".

The two possible values are:

None - the field is not present and must always be stored as NULL Text - a single text field will be provided on the interface

Whatever the interface, the value built will always be stored in the config column of the table.

We then also provide a "get_schema()" method that returns a schema cablable of parsing the data submitted by the form and storing it as a single key named "config". ckanext-inspire will then add other functions to the schema for URL etc so that regardless of what the harvest plugin does, the key fields ckanext-harvest are always there.

This then allows:

  • Customiseable config interfaces
  • Customisable valiation
  • A consistent place to store config in the database

Sound good?

comment:2 Changed 3 years ago by sebbacon

Agreed with James that we should consider *from the start* how to provide for a nice UI -- my worry is if we start out with people editing raw JSON we'll end up keeping that for years :)

comment:3 Changed 23 months ago by seanh

  • Milestone set to ckan-backlog
Note: See TracTickets for help on using tickets.