Version 2 (modified by dread, 4 years ago) (diff) |
---|
SOLR Interface
Use Cases
As an admin of a CKAN instance I want to have the facility to use SOLR, both alongside and as a replacement for the existing full text search in CKAN.
Design
There are two main options for getting data into SOLR:
- POST the records to SOLR in XML format (docs)
- Direct connection Setup SOLR (docs)
- Provide SELECT statements to do queries
- Process is initiated by doing a GET to a particular SOLR URL
The preference is for the first option as the abstraction provides more flexibility in the db and more control about what gets indexed.
When to index a package? Currently we index it on database after_insert and after_update triggers. But this might seriously slow down a large data import since the indexing requires a POST over the internet. Maybe keep the triggers, but for a batch import we can turn them off and then manually run the indexing. Alternatively store up changes and do an hourly cron.
Tickets
- Get a SOLR instance running locally and/or eu1, using basic config.
- Get indexing and searching working with name and title fields only:
- Harness one of the three python SOLR libraries to send SOLR Update XML of CKAN Packages (triggered on the command-line).
- Write tests for SOLR by sending data with SOLR library and using JSON interface for queries.
- Get it working with all package fields, optimising the field descriptions in schema.xml.
- Trigger the indexing sensibly (as decided above).
- Provide option to connect CKAN's search WUI to SOLR back-end.
- Developer docs - description of how to setup SOLR and provide link to schema.xml in developer docs.