Ticket #318 (new defect) — at Version 2

Opened 4 years ago

Last modified 19 months ago

Insufficient validation of resource URIs

Reported by: wwaites Owned by: dread
Priority: major Milestone: ckan-sprint-2011-10-28
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description (last modified by rgrp) (diff)

The CKAN instance on data.gov.uk serves invalid URIs out of its API.

For example the following can be found,

http://uk.sitestat.com/homeoffice/rds/s?rds.hosb0509tabsxls&ns_type=pdf&ns_url=[http://www.homeoffice.gov.uk/rds/pdfs09/hosb0509tabs.xls]

In this URI, the : and / characters after the ? in the query part are invalid according to section 3.4 of RFC2396

Also URIs are not stripped of whitespace at the end.

This causes problems when other software with a more correct interpretation of what a valid URI is attempts to consume data from CKAN. In this instance the Talis triplestore complains about such URIs.

"Be liberal in what you accept and conservative in what you send" would seem apt.

Actions

  • Validation of urls as part of form entry or data loading
    • Need to consider situation where this should happen out-of-band (i.e. we allow load even with invalid data and then flag bad dates in separate validation process). In general doubtful that we should do this here because url invalidity is such a big deal
  • This code should support analysis of existing data so we can go through existing database and find invalid urls
    • Also useful to have this so we can do out of band validation

Change History

comment:1 Changed 4 years ago by wwaites

Some more datapoints from Leigh Dodds of Talis:

I'm still having no joy with this I'm afraid. I'm test parsing the data locally using the TDB command-line tools, specifically tdbcheck which will parse the data and generate warnings/exceptions. This uses the same parsing code, data and URI validation code as we're using on the Platform.

Currently its giving me warnings for invalid lexical values for dates, e.g:

Lexical not valid for datatype: "2008"http://www.w3.org/2001/XMLSchema#date

While these aren't a major issue, looking at some of the data suggests that there are more underlying data problems that need checking and fixing up, e.g:

Lexical not valid for datatype: "n/a"http://www.w3.org/2001/XMLSchema#date Lexical not valid for datatype: "27/04/2006 13:56"http://www.w3.org/2001/XMLSchema#date Lexical not valid for datatype: "Real time calculation"http://www.w3.org/2001/XMLSchema#date Lexical not valid for datatype: "varies by country"http://www.w3.org/2001/XMLSchema#date

And there are still some invalid URIs, e.g:

<https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&ref=3.02.16> Code: 30/ILLEGAL_PERCENT_ENCODING in QUERY: The host component a percent occurred without two following hexadecimal digits.

Can I suggest you try running the converted data through tdbcheck to iron out any problems? Then I can push it into the Platform.

comment:2 Changed 4 years ago by rgrp

  • Owner changed from rgrp to dread
  • Priority changed from major to blocker
  • Description modified (diff)
  • Milestone set to v1.1
Note: See TracTickets for help on using tickets.