Ticket #1430 (closed defect: fixed)

Opened 3 years ago

Last modified 2 years ago

Documents get mixed between SOLR cores

Reported by: amercader Owned by:
Priority: major Milestone: ckan-sprint-2011-11-07
Component: ckan Keywords: search
Cc: johnglover, thejimmyg, kindly, rgrp, nils.toedtmann@… Repository: ckan
Theme: none

Description

On some occasions (apparently random), the documents indexed in a specific SOLR core get mixed with different site_ids.

E.g: We look for all documents in the testing.iatiregistry.org core, faceted by site_id. We would expect all documents to have site_id = iati_testing, but some of them have site_id = iatiregistry.org

http://okfn-solr.fry-it.com:8080/solr/testing.iatiregistry.org/select?indent=on&version=2.2&q=*:*&fq=+state:active&facet=true&facet.field=site_id

<lst name="facet_fields">
<lst name="site_id">
<int name="iati_testing">265</int>
<int name="iatiregistry.org">255</int>
</lst>
</lst>

If we compare one of the records which disappeared from the "iati_testing" site_id in both the production and testing SOLR cores of the server, the records are exactly the same, including the indexed_ts property:

http://okfn-solr.fry-it.com:8080/solr/iatiregistry.org/select?indent=on&version=2.2&q=id:97d1ab0d-b203-4757-8f4e-a0c84d2f759f&facet=true&facet.field=site_id

http://okfn-solr.fry-it.com:8080/solr/testing.iatiregistry.org/select?indent=on&version=2.2&q=id:97d1ab0d-b203-4757-8f4e-a0c84d2f759f&facet=true&facet.field=site_id

Note that the response from the URLs shown may vary, as the testing site could have been reindexed.

Attachments

jetty.requests.log (210 bytes) - added by amercader 2 years ago.
Requests received after editing a dataset
jetty.stderrout.log (6.5 KB) - added by amercader 2 years ago.
SOLR output after editing a dataset

Change History

Changed 2 years ago by amercader

Requests received after editing a dataset

Changed 2 years ago by amercader

SOLR output after editing a dataset

comment:1 Changed 2 years ago by amercader

  • Cc rgrp added

I've been digging more on this one. To reproduce it, you just have to edit the same dataset in both sites (production and testing). Just after editing the dataset, the search index will get mixed site_ids.

I checked the jetty logs (see attached files) and just after editing a dataset there are two POST requests to update the index. The request logs don't show the requests params so it's hard to tell what the second call does (it probably is the commit): https://bitbucket.org/okfn/ckan/src/97e1e90d66d7/ckan/lib/search/index.py#cl-144

In any case, it's clear that the problem may be related with the datasets in the two cores sharing the same id. We are currently using the dataset id as uniqueKey in SOLR, i.e in our schema.xml, we are defining:

<uniqueKey>id</uniqueKey>

According to the SOLR docs: "If a document is added that contains the same value for this field as an existing document, the old document will be deleted."

http://wiki.apache.org/solr/SchemaXml#The_Unique_Key_Field

I would expect the uniqueKey not to be mixed between cores, but it looks like it happens otherwise. Maybe we should generate a solr_id specific to each document for each site, as described here: http://wiki.apache.org/solr/UniqueKey#UUID_techniques

(Note that apart from the testing/production site use case, at some point sites involved in harvesting could also end up with datasets with the same id.)

Again, I'm not a SOLR expert, so the problem could be a completely different one!

comment:2 Changed 2 years ago by nils.toedtmann

  • Cc nils.toedtmann@… added

comment:3 Changed 2 years ago by amercader

Right, more news on this front.

I've tested a patch which uses a hash of the dataset id and site_id to produce a unique id, and then configured the iati solr cores to use index_id as uniqueKey: https://bitbucket.org/okfn/ckan/changeset/855f5a452f60

Unfortunately, that did not solve the issue. Again, updating the same dataset in both apps messes things up. In this case, documents don't get replaced, but duplicated, so the new uniqueKey is working.

I am more inclined to think that this is caused by a misconfiguration in the SOLR instance in s046. This is the file where the two cores are configured:

<solr persistent="true" sharedLib="lib">
 <cores adminPath="/admin/cores">
  <core name="testing.iatiregistry.org" instanceDir="testing.iatiregistry.org">
    <property name="dataDir" value="/usr/share/solr/testing.iatiregistry.org/data" />
  </core>
  <core name="iatiregistry.org" instanceDir="iatiregistry.org">
    <property name="dataDir" value="/usr/share/solr/iatiregistry.org/data" />
  </core>
 </cores>
</solr>

Following this paths: /usr/share/solr/iatiregistry.org symlinks to /etc/solr/iatiregistry.org, /etc/solr/iatiregistry.org/data is empty (as well as the testing equivalent). On the other hand, looking at the admin interface and at some errors that I got it seems that the data folder that both cores are using is /var/lib/solr/data/index

Maybe that's the problem?

comment:4 Changed 2 years ago by pudo

You probably know this by know but the solrconfig.xml for both points to /var/lib/solr/data (line 71).

comment:5 Changed 2 years ago by amercader

No, I didn't know it. Is this supposed to be the correct setting or should each core have its own data dir?

comment:6 Changed 2 years ago by rgrp

@pudo: brilliant spot. How did that misconfig end up there? Anyway, commenting out and rebooting means we do now have separate cores seems to fix it.

comment:7 Changed 2 years ago by amercader

  • Status changed from new to closed
  • Resolution set to fixed

As rgrp mentioned, we commented out the <dataDir> directive in the solrconfig.xml files and rebooted. That made the cores use the data dir they were supposed to (the one in solr.xml) and from the tests I made looks like it finally fixed the issue.

comment:8 Changed 2 years ago by dread

Fixed in CKAN 1.5.1. Affects all previous CKANs that use SOLR in these circumstances.

Note: See TracTickets for help on using tickets.