<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>CKAN: Ticket Query</title>
    <link>http://localhost/query?status=!closed&amp;keywords=~harvest&amp;order=keywords</link>
    <description>The open source data portal software</description>
    <language>en-US</language>
    <image>
      <title>CKAN</title>
      <url>http://assets.okfn.org/p/ckan/img/ckan_logo_shortname.png</url>
      <link>http://localhost/query?status=!closed&amp;keywords=~harvest&amp;order=keywords</link>
    </image>
    <generator>Trac 0.12.3</generator>
    <item>
        <link>http://localhost/ticket/1134</link>
        <guid isPermaLink="false">http://localhost/ticket/1134</guid>
        <title>#1134: CREP0003: Description and Configuration of Harvesters</title>
        <pubDate>Wed, 11 May 2011 10:14:28 GMT</pubDate>
        
        <dc:creator>amercader</dc:creator>

        <description>&lt;p&gt;
&lt;strong&gt;Proposer&lt;/strong&gt;: Adrià Mercader
&lt;/p&gt;
&lt;h2 id="Abstract"&gt;Abstract&lt;/h2&gt;
&lt;p&gt;
The new harvester interface allows to create harvesters for different
sources, but right now harvesters don't have many ways to describe and
configure themselves. We need a way of allowing them to:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;Expose their type and other details so they can be used internally
and on the UI.
&lt;/li&gt;&lt;li&gt;Define configuration settings for particular harvester instances.
&lt;/li&gt;&lt;/ul&gt;&lt;h2 id="TheProblem"&gt;The Problem&lt;/h2&gt;
&lt;h3 id="Harvesterdescription"&gt;Harvester description&lt;/h3&gt;
&lt;p&gt;
The current UI for adding and editing harvest sources is the same used
in ckanext-dgu, and thus the 3 harvester types used in DGU to harvest
various GEMINI realted sources are hardcoded in the form. The form will
be migrated to a DGU-independent one, so we need the harvesters to
provide all the necessary data. There is a current &lt;tt&gt;get_type&lt;/tt&gt; method
that returns the harvester type, but for make it compatible with the DGU
forms, it returns a machine-readable string (e.g. "CSW Server"), making
it error prone.
&lt;/p&gt;
&lt;h3 id="Arbitraryconfiguration"&gt;Arbitrary configuration&lt;/h3&gt;
&lt;p&gt;
In the current implementation, when the harvest process is started,
ckanext-harvest looks for all the available plugins that implement the
&lt;tt&gt;IHarvester&lt;/tt&gt; interface and calls the appropiate methods for the
current stage (&lt;tt&gt;gather_stage&lt;/tt&gt;,&lt;tt&gt;fetch_stage&lt;/tt&gt;,&lt;tt&gt;import_stage&lt;/tt&gt;).
At these stages, harvesters have no way of applying arbitrary
configuration options, so all harvesters of the same type behave on the
same way.
For instance, the CKAN harvester needs a way to define the API version
to use when harvesting remote instances (Right now, the version 2 is
hardcoded on the code).
&lt;/p&gt;
&lt;h2 id="Specification"&gt;Specification&lt;/h2&gt;
&lt;h3 id="Harvesterdescription1"&gt;Harvester description&lt;/h3&gt;
&lt;p&gt;
Harvesters will need to provide the following information so the UI form
can be built:
&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;name: machine-readable name (e.g. "waf"). This will be the value
stored in the database, and the one used by ckanext-harvest to
call the appropiate harvester.
&lt;/li&gt;&lt;li&gt;title: human-readable name (e.g. "Web Accessible Folder (WAF)").
This will appear in the form's select box.
&lt;/li&gt;&lt;li&gt;description: a description of what the harvester does (e.g. "A Web
Accessible Folder (WAF) displaying a list of GEMINI 2.1
documents"). This will appear on the form as a guidance to the
user.
&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;
The way to provide it will be an &lt;tt&gt;info&lt;/tt&gt; method that all harvesters
must implement, which will return a dictionary with the previous
elements:
&lt;/p&gt;
&lt;pre class="wiki"&gt;    {
        'name': 'csw',
        'title': 'CSW Server',
        'description': 'A server that implements OGC's Catalog Service
                        for the Web (CSW) standard'
    }
&lt;/pre&gt;&lt;h3 id="Arbitraryconfiguration1"&gt;Arbitrary configuration&lt;/h3&gt;
&lt;p&gt;
As different harvesters will have very different needs, we need to
provide a way to persist arbitrary configuration flags for each harvest
source. The more flexible way given the current architecture in my
opinion would be to store the configuration options as a JSON encoded
object as a property of the harvest source (There already is an unused
DB field called &lt;tt&gt;config&lt;/tt&gt; in the database) (Maybe using &lt;a class="missing wiki"&gt;JsonType?&lt;/a&gt;?).
&lt;/p&gt;
&lt;p&gt;
This will mean adding an extra field in the harvest source form to allow
entering the configuration. This could be just a simple text field where
users enter the JSON encoded object or a more clever mechanism (i.e an
"Add a configuration flag" link that adds two new text fields for the
key and value for each flag, and a mechanism to later build the JSON
object). In any case, this should probably be hidden in an "Advance
options" section.
&lt;/p&gt;
&lt;h2 id="Whydoitthisway"&gt;Why do it this way&lt;/h2&gt;
&lt;h3 id="Harvesterdescription2"&gt;Harvester description&lt;/h3&gt;
&lt;p&gt;
The &lt;tt&gt;info&lt;/tt&gt; method would provide a single point to get all the
information related to the harvester, and future properties could be
added to the dictionary returned without having to modify the interface.
&lt;/p&gt;
&lt;h3 id="Arbitraryconfiguration2"&gt;Arbitrary configuration&lt;/h3&gt;
&lt;p&gt;
There is an already existing &lt;tt&gt;config&lt;/tt&gt; field in the database, so we
won't need to change the model.
Harvesters could access the config object at any of the stages. Of
course they could provide default values in their implementations so
users don't need to enter them everytime.
&lt;/p&gt;
&lt;h2 id="Implementationplan"&gt;Implementation plan&lt;/h2&gt;
&lt;h3 id="Deliverables"&gt;Deliverables&lt;/h3&gt;
&lt;h3 id="Risksandmitigations"&gt;Risks and mitigations&lt;/h3&gt;
&lt;p&gt;
The highest risk on the harvesters &lt;tt&gt;info&lt;/tt&gt; method side is that
harvester implementation don't offer one of the necessary properties
(namely name and title). This could fire a warning when showing the
UI form or using the CLI.
&lt;/p&gt;
&lt;h3 id="Participants"&gt;Participants&lt;/h3&gt;
&lt;p&gt;
Adrià Mercader to do it.
&lt;/p&gt;
&lt;h3 id="Progress"&gt;Progress&lt;/h3&gt;
&lt;p&gt;
None yet.
&lt;/p&gt;
</description>
        <category>Results</category>
        <comments>http://localhost/ticket/1134#changelog</comments>
    </item>
 </channel>
</rss>