| 1 | = CKAN Search Engine = |
| 2 | |
| 3 | == Use Cases == |
| 4 | As a user of a CKAN instance I want to be able to make complicated searches, referencing the data fields. |
| 5 | |
| 6 | == Design == |
| 7 | |
| 8 | Search technology - Apache SOLR is selected |
| 9 | |
| 10 | Architecture: SOLR to work both alongside and as a replacement for the existing full text search in CKAN. |
| 11 | |
| 12 | There are two main options for getting data into SOLR: |
| 13 | |
| 14 | * POST the records to SOLR in XML format ([http://wiki.apache.org/solr/UpdateXmlMessages docs]) |
| 15 | * Direct connection Setup SOLR ([http://wiki.apache.org/solr/DataImportHandler docs]) |
| 16 | * Provide SELECT statements to do queries |
| 17 | * Process is initiated by doing a GET to a particular SOLR URL |
| 18 | |
| 19 | The preference is for the first option as the abstraction provides more flexibility in the db and more control about what gets indexed. |
| 20 | |
| 21 | When to index a package? Currently we index it on database after_insert and after_update triggers. But this might seriously slow down a large data import since the indexing requires a POST over the internet. Maybe keep the triggers, but for a batch import we can turn them off and then manually run the indexing. Alternatively store up changes and do an hourly cron. |
| 22 | |
| 23 | == Tickets == |
| 24 | |
| 25 | 1 Get a SOLR instance running, using basic config. |
| 26 | 2 Get indexing and searching working with name and title fields only: |
| 27 | * Harness one of the three python SOLR libraries to send SOLR Update XML of CKAN Packages (triggered on the command-line). |
| 28 | * Write tests for SOLR by sending data with SOLR library and using JSON interface for queries. |
| 29 | 3 Get it working with all package fields, optimising the field descriptions in schema.xml. |
| 30 | 4 Trigger the indexing sensibly (as decided above). |
| 31 | 5 Provide option to connect CKAN's search WUI to SOLR back-end. |
| 32 | 6 CKAN Developer docs - description of how to setup SOLR link and schema.xml. |