Ticket #318 (closed defect: fixed)

Opened 4 years ago

Last modified 19 months ago

Insufficient validation of resource URIs

Reported by: wwaites Owned by: johnglover
Priority: major Milestone: ckan-sprint-2011-10-28
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description (last modified by rgrp) (diff)

The CKAN instance on data.gov.uk serves invalid URIs out of its API.

For example the following can be found,

http://uk.sitestat.com/homeoffice/rds/s?rds.hosb0509tabsxls&ns_type=pdf&ns_url=[http://www.homeoffice.gov.uk/rds/pdfs09/hosb0509tabs.xls]

In this URI, the : and / characters after the ? in the query part are invalid according to section 3.4 of RFC2396

Also URIs are not stripped of whitespace at the end.

This causes problems when other software with a more correct interpretation of what a valid URI is attempts to consume data from CKAN. In this instance the Talis triplestore complains about such URIs.

"Be liberal in what you accept and conservative in what you send" would seem apt.

Actions

  • Validation of urls as part of form entry or data loading
    • Need to consider situation where this should happen out-of-band (i.e. we allow load even with invalid data and then flag bad dates in separate validation process). In general doubtful that we should do this here because url invalidity is such a big deal
  • This code should support analysis of existing data so we can go through existing database and find invalid urls
    • Also useful to have this so we can do out of band validation

Change History

comment:1 Changed 4 years ago by wwaites

Some more datapoints from Leigh Dodds of Talis:

I'm still having no joy with this I'm afraid. I'm test parsing the data locally using the TDB command-line tools, specifically tdbcheck which will parse the data and generate warnings/exceptions. This uses the same parsing code, data and URI validation code as we're using on the Platform.

Currently its giving me warnings for invalid lexical values for dates, e.g:

Lexical not valid for datatype: "2008"http://www.w3.org/2001/XMLSchema#date

While these aren't a major issue, looking at some of the data suggests that there are more underlying data problems that need checking and fixing up, e.g:

Lexical not valid for datatype: "n/a"http://www.w3.org/2001/XMLSchema#date Lexical not valid for datatype: "27/04/2006 13:56"http://www.w3.org/2001/XMLSchema#date Lexical not valid for datatype: "Real time calculation"http://www.w3.org/2001/XMLSchema#date Lexical not valid for datatype: "varies by country"http://www.w3.org/2001/XMLSchema#date

And there are still some invalid URIs, e.g:

<https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&ref=3.02.16> Code: 30/ILLEGAL_PERCENT_ENCODING in QUERY: The host component a percent occurred without two following hexadecimal digits.

Can I suggest you try running the converted data through tdbcheck to iron out any problems? Then I can push it into the Platform.

comment:2 Changed 4 years ago by rgrp

  • Owner changed from rgrp to dread
  • Priority changed from major to blocker
  • Description modified (diff)
  • Milestone set to v1.1

comment:3 Changed 4 years ago by dread

  • Owner changed from dread to rgrp

We can't change any of the metadata without permission from the various departments who supplied it.

I think "Don't shoot the messenger" is apt here.

Adding this to the form validation isn't going to change any of the existing data. I think this is better off in the data quality scoring.

comment:4 Changed 4 years ago by wwaites

url validation reputed to be here:

http://www.livinglogic.de/Python/url/Howto.html

comment:5 Changed 4 years ago by wwaites

Some good news, ll.url seems to take bad urls and make them into good urls.

viz:

In [1]: from ll import url
In [2]: print url.URL("https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&ref=3.02.16")
------> print(url.URL("https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&ref=3.02.16"))
/Users/ww/Work/OKF/ckanrdf/lib/python2.6/site-packages/ll/url.py:2358: UserWarning: truncated escape at position 4
  value = _unescape(namevalue[1].replace("+", " "))
https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%253&ref=3%2E02%2E16

comment:6 Changed 4 years ago by wwaites

Also fyi, getting ll.url is done like so

{{ pip install ll-xist }}}

comment:7 Changed 4 years ago by wwaites

I've updated ckanrdf to strip out datatypes and use this ll.url on external references so that should be sufficient to hold off talis.

Still need to work particularly on validating dates though...

comment:8 Changed 4 years ago by rgrp

  • Owner changed from rgrp to dread

comment:9 Changed 4 years ago by rgrp

  • Owner changed from dread to rgrp
  • Milestone v1.1 deleted

comment:10 Changed 4 years ago by rgrp

  • Owner changed from rgrp to pudo

Important but low priority according to CO so bumping into next milestone (v1.2). NB: did not seem able to update milestone in trac interface! (Perhaps due to agilo stuff?)

comment:11 Changed 4 years ago by wwaites

CO may not realise the implications when they said it was low priority. The implication of this lack of validation is that it is impossible to generate valid URIs in the RDF which means it cannot be imported by Talis. So until there is a solution to this, no RDF catalog.

comment:12 Changed 3 years ago by rgrp

  • Priority changed from blocker to awaiting triage

Still not sure what the priority is so moving to awaiting triage.

comment:13 Changed 3 years ago by pudo

  • Status changed from new to closed
  • Resolution set to wontfix

This will be implicit in #852, thus not building something specific for it now.

comment:14 Changed 3 years ago by wwaites

  • Priority changed from awaiting triage to major
  • Status changed from closed to reopened
  • Resolution wontfix deleted

We still require form validation to check URIs. They are not free-form strings. This is not the same as 852 or necessarily included in it.

comment:15 Changed 3 years ago by thejimmyg

  • Owner changed from pudo to johnglover
  • Repository set to ckan
  • Theme set to none
  • Status changed from reopened to assigned
  • Milestone set to ckan-current-sprint

Assigning to John so that he can see whether the QA code correctly flags these kinds of problems. If it does, we can close this ticket because although the API will serve invalid URLs, the publishers will be notified to clean up.

comment:16 Changed 3 years ago by johnglover

  • Status changed from assigned to closed
  • Resolution set to fixed

The QA code should identify invalid URLs. Resources with invalid urls will have an 'openness_score' of 0 and an 'openness_score_reason' of 'Invalid url scheme' or 'invalid URL'.

comment:17 Changed 19 months ago by dread

Here's a real example - one of many from MOD http://www.dasa.mod.uk/applications/newWeb/www/index.php?page=48&thiscontent=180&date=2011-05-26&pubType=1&PublishTime=09:30:00&from=home&tabOption=1 Browsers accept colons and slashes happily, which is the main usage of our links. The URL looks better with the colons and slashes, rather than the encoded version. The average departmental user doesn't understand that the reason to encode them is for some academic RFC and RDF which is not "liberal in what it accepts". Since the RDF tool has a satisfactory way to encode links, this problem is essentially solved. Therefore I'm changing ckanext-archiver to accept these unencoded links, I'm afraid.

Note: See TracTickets for help on using tickets.