Ticket #2331 (reopened defect)

Opened 2 years ago

Last modified 16 months ago

Search should AND terms not OR terms

Reported by: rgrp Owned by: kindly
Priority: major Milestone: ckan-sprint-2012-05-29
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description

Appears current default search in CKAN ORs terms rather than ANDing them (i.e. adding more terms increasing number of items found rather than reducing it).

Not sure when this crept in or if it has been there for a long time.

Change History

comment:1 Changed 2 years ago by rgrp

  • Owner set to rossjones
  • Status changed from new to assigned
  • Milestone changed from ckan-v1.7 to current-ckan-sprint-2012-05-29

comment:2 Changed 2 years ago by kindly

This is a wontfix. I think terms should be ored by default. All modern search engines work like this. If there is an issue due to relevancy (i.e if you type mulitple words and your result not coming near the top) then we should use this examples so we can tweak the results.

comment:3 Changed 2 years ago by ross

  • Owner changed from rossjones to kindly

comment:4 Changed 2 years ago by kindly

  • Status changed from assigned to closed
  • Resolution set to wontfix

comment:5 Changed 2 years ago by rgrp

  • Status changed from closed to reopened
  • Resolution wontfix deleted

@kindly I really don't get this but I may understand poorly. Google does not work by or-ing does it (as i add things to the search query I get *less* results not more ...).

It also makes it really hard to find stuff: as I add query terms to my query I get more stuff not less which makes for terrible finding of stuff - I appreciate that there is a scoring aspect but that seems secondary.

comment:6 Changed 2 years ago by kindly

  • Status changed from reopened to closed
  • Resolution set to wontfix

Scoring is primary in my opinion. Who cares if you have 1000 results if the top few are yours.

If things are hard to find we need to change our relevancy first. So if you have examples of where you think the scoring is wrong then please make a ticket for that.

Google, I imagine, just has a cutoff of anything under a certain score not being shown. We could do that as well but it would take some working out what we wanted that score to be.

Full "And" queries also limit any accidental discovery, especially of rare terms. If you do not get your search exactly correct then you get nothing, which is bad.

Obviously you can still AND things or +things.

The correct solution to this is adding a minimum match parameter which is a middle ground. eg you can say that you want to match over half of the terms. "thing1 thing2 thing3 thing3" means you have to match at least 2. There are many options described here. http://wiki.apache.org/solr/DisMaxQParserPlugin. You can change this in an extension if you want it just requries adding an mm field to the before_search in the Ipackage controller. I do not personally think we should change it as default.

I am closing it as wont fix as its trivial to change and is a philosophical difference not a technical one.

comment:7 Changed 21 months ago by rgrp

  • Status changed from closed to reopened
  • Resolution wontfix deleted

I'm re-opening this. This is a reasonably significant UX problem -- not just a philosophical difference. It is just *weird* to get *more* results as I add search terms rather than less. Please can we have this fixed ...

comment:8 Changed 21 months ago by kindly

  • Status changed from reopened to closed
  • Resolution set to wontfix

As I said I do not not agree with it being a UX problem or *wierd*. *wierd* is not acceptable in response to a thought out comment. So I am closing it again.

comment:9 Changed 21 months ago by rgrp

  • Status changed from closed to reopened
  • Resolution wontfix deleted

Re-opening yet again. This is something brought up by entire client team on several occasions :-)

We really do need to fix this as it is really poor UX. To give, yet another, example:

http://datahub.io/dataset?q=thatcher+wages

This returns 15 datasets yet only one of them has any mention of thatchers (the first one!). (The others are there because they match wages). This isn't what I expect. I expect *only* to see datasets that match *both* terms ...

comment:10 Changed 21 months ago by ross

This is the dullest game of ticket tennis ever :D

Google don't OR the results, they have an explicit operator for ORing query terms and default to ensuring that all terms are present (in one form or another) in the documents. I understand the argument about cut-offs and relevancy scores, but for a large search engine they have the benefit of scale both in resource and documents. I spent 6 months of my life tweaking weightings for n-dimensional vector space search, I don't think it is likely to be a number we can, or will find by accident.

Given that Marcus also got confused by the current behaviour I think we should either:

a) Make it configurable b) Just change it to what has been asked for

Don't care which.

comment:11 Changed 16 months ago by rgrp

Now transferred to github at https://github.com/okfn/ckan/issues/258

Note: See TracTickets for help on using tickets.