Ticket #1430 (closed defect: fixed)
Documents get mixed between SOLR cores
Reported by: | amercader | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | ckan-sprint-2011-11-07 |
Component: | ckan | Keywords: | search |
Cc: | johnglover, thejimmyg, kindly, rgrp, nils.toedtmann@… | Repository: | ckan |
Theme: | none |
Description
On some occasions (apparently random), the documents indexed in a specific SOLR core get mixed with different site_ids.
E.g: We look for all documents in the testing.iatiregistry.org core, faceted by site_id. We would expect all documents to have site_id = iati_testing, but some of them have site_id = iatiregistry.org
<lst name="facet_fields"> <lst name="site_id"> <int name="iati_testing">265</int> <int name="iatiregistry.org">255</int> </lst> </lst>
If we compare one of the records which disappeared from the "iati_testing" site_id in both the production and testing SOLR cores of the server, the records are exactly the same, including the indexed_ts property:
Note that the response from the URLs shown may vary, as the testing site could have been reindexed.
Attachments
Change History
Changed 2 years ago by amercader
- Attachment jetty.requests.log added
Changed 2 years ago by amercader
- Attachment jetty.stderrout.log added
SOLR output after editing a dataset
comment:1 Changed 2 years ago by amercader
- Cc rgrp added
I've been digging more on this one. To reproduce it, you just have to edit the same dataset in both sites (production and testing). Just after editing the dataset, the search index will get mixed site_ids.
I checked the jetty logs (see attached files) and just after editing a dataset there are two POST requests to update the index. The request logs don't show the requests params so it's hard to tell what the second call does (it probably is the commit): https://bitbucket.org/okfn/ckan/src/97e1e90d66d7/ckan/lib/search/index.py#cl-144
In any case, it's clear that the problem may be related with the datasets in the two cores sharing the same id. We are currently using the dataset id as uniqueKey in SOLR, i.e in our schema.xml, we are defining:
<uniqueKey>id</uniqueKey>
According to the SOLR docs: "If a document is added that contains the same value for this field as an existing document, the old document will be deleted."
http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field
I would expect the uniqueKey not to be mixed between cores, but it looks like it happens otherwise. Maybe we should generate a solr_id specific to each document for each site, as described here: http://wiki.apache.org/solr/UniqueKey#UUID_techniques
(Note that apart from the testing/production site use case, at some point sites involved in harvesting could also end up with datasets with the same id.)
Again, I'm not a SOLR expert, so the problem could be a completely different one!
comment:3 Changed 2 years ago by amercader
Right, more news on this front.
I've tested a patch which uses a hash of the dataset id and site_id to produce a unique id, and then configured the iati solr cores to use index_id as uniqueKey: https://bitbucket.org/okfn/ckan/changeset/855f5a452f60
Unfortunately, that did not solve the issue. Again, updating the same dataset in both apps messes things up. In this case, documents don't get replaced, but duplicated, so the new uniqueKey is working.
I am more inclined to think that this is caused by a misconfiguration in the SOLR instance in s046. This is the file where the two cores are configured:
<solr persistent="true" sharedLib="lib"> <cores adminPath="/admin/cores"> <core name="testing.iatiregistry.org" instanceDir="testing.iatiregistry.org"> <property name="dataDir" value="/usr/share/solr/testing.iatiregistry.org/data" /> </core> <core name="iatiregistry.org" instanceDir="iatiregistry.org"> <property name="dataDir" value="/usr/share/solr/iatiregistry.org/data" /> </core> </cores> </solr>
Following this paths: /usr/share/solr/iatiregistry.org symlinks to /etc/solr/iatiregistry.org, /etc/solr/iatiregistry.org/data is empty (as well as the testing equivalent). On the other hand, looking at the admin interface and at some errors that I got it seems that the data folder that both cores are using is /var/lib/solr/data/index
Maybe that's the problem?
comment:4 Changed 2 years ago by pudo
You probably know this by know but the solrconfig.xml for both points to /var/lib/solr/data (line 71).
comment:5 Changed 2 years ago by amercader
No, I didn't know it. Is this supposed to be the correct setting or should each core have its own data dir?
comment:6 Changed 2 years ago by rgrp
@pudo: brilliant spot. How did that misconfig end up there? Anyway, commenting out and rebooting means we do now have separate cores seems to fix it.
comment:7 Changed 2 years ago by amercader
- Status changed from new to closed
- Resolution set to fixed
As rgrp mentioned, we commented out the <dataDir> directive in the solrconfig.xml files and rebooted. That made the cores use the data dir they were supposed to (the one in solr.xml) and from the tests I made looks like it finally fixed the issue.
Requests received after editing a dataset