<?xml version="1.0"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>CKAN: Ticket #318: Insufficient validation of resource URIs</title>
    <link>http://localhost/ticket/318</link>
    <description>&lt;p&gt;
The CKAN instance on data.gov.uk serves invalid URIs out of its API.
&lt;/p&gt;
&lt;p&gt;
For example the following can be found,
&lt;/p&gt;
&lt;p&gt;
&lt;a class="ext-link" href="http://uk.sitestat.com/homeoffice/rds/s?rds.hosb0509tabsxls&amp;amp;ns_type=pdf&amp;amp;ns_url=[http://www.homeoffice.gov.uk/rds/pdfs09/hosb0509tabs.xls"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://uk.sitestat.com/homeoffice/rds/s?rds.hosb0509tabsxls&amp;amp;ns_type=pdf&amp;amp;ns_url=[http://www.homeoffice.gov.uk/rds/pdfs09/hosb0509tabs.xls&lt;/a&gt;]
&lt;/p&gt;
&lt;p&gt;
In this URI, the : and / characters after the ? in the query part are invalid according to section 3.4 of RFC2396
&lt;/p&gt;
&lt;p&gt;
Also URIs are not stripped of whitespace at the end.
&lt;/p&gt;
&lt;p&gt;
This causes problems when other software with a more correct interpretation of what a valid URI is attempts to consume data from CKAN. In this instance the Talis triplestore complains about such URIs.
&lt;/p&gt;
&lt;p&gt;
"Be liberal in what you accept and conservative in what you send" would seem apt.
&lt;/p&gt;
&lt;h2 id="Actions"&gt;Actions&lt;/h2&gt;
&lt;ul&gt;&lt;li&gt;Validation of urls as part of form entry or data loading
&lt;ul&gt;&lt;li&gt;Need to consider situation where this should happen out-of-band (i.e. we allow load even with invalid data and then flag bad dates in separate validation process). In general doubtful that we should do this here because url invalidity is such a big deal
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;This code should support analysis of existing data so we can go through existing database and find invalid urls
&lt;ul&gt;&lt;li&gt;Also useful to have this so we can do out of band validation
&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;</description>
    <language>en-us</language>
    <image>
      <title>CKAN</title>
      <url>http://assets.okfn.org/p/ckan/img/ckan_logo_shortname.png</url>
      <link>http://localhost/ticket/318</link>
    </image>
    <generator>Trac 0.12.3</generator>
    <item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Thu, 20 May 2010 17:43:05 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:1</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:1</guid>
      <description>
        &lt;p&gt;
Some more datapoints from Leigh Dodds of Talis:
&lt;/p&gt;
&lt;p&gt;
I'm still having no joy with this I'm afraid. I'm test parsing the
data locally using the TDB command-line tools, specifically tdbcheck
which will parse the data and generate warnings/exceptions. This uses
the same parsing code, data and URI validation code as we're using on
the Platform.
&lt;/p&gt;
&lt;p&gt;
Currently its giving me warnings for invalid lexical values for dates, e.g:
&lt;/p&gt;
&lt;p&gt;
Lexical not valid for datatype: "2008"&lt;sup&gt;&lt;/sup&gt;&lt;a class="ext-link" href="http://www.w3.org/2001/XMLSchema#date"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.w3.org/2001/XMLSchema#date&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
While these aren't a major issue, looking at some of the data suggests
that there are more underlying data problems that need checking and
fixing up, e.g:
&lt;/p&gt;
&lt;p&gt;
Lexical not valid for datatype: "n/a"&lt;sup&gt;&lt;/sup&gt;&lt;a class="ext-link" href="http://www.w3.org/2001/XMLSchema#date"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.w3.org/2001/XMLSchema#date&lt;/a&gt;
Lexical not valid for datatype: "27/04/2006
13:56"&lt;sup&gt;&lt;/sup&gt;&lt;a class="ext-link" href="http://www.w3.org/2001/XMLSchema#date"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.w3.org/2001/XMLSchema#date&lt;/a&gt;
Lexical not valid for datatype: "Real time
calculation"&lt;sup&gt;&lt;/sup&gt;&lt;a class="ext-link" href="http://www.w3.org/2001/XMLSchema#date"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.w3.org/2001/XMLSchema#date&lt;/a&gt;
Lexical not valid for datatype: "varies by
country"&lt;sup&gt;&lt;/sup&gt;&lt;a class="ext-link" href="http://www.w3.org/2001/XMLSchema#date"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.w3.org/2001/XMLSchema#date&lt;/a&gt;
&lt;/p&gt;
&lt;p&gt;
And there are still some invalid URIs, e.g:
&lt;/p&gt;
&lt;p&gt;
&amp;lt;&lt;a class="ext-link" href="https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&amp;amp;ref=3.02.16"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&amp;amp;ref=3.02.16&lt;/a&gt;&amp;gt;
Code: 30/ILLEGAL_PERCENT_ENCODING in QUERY: The host component a
percent occurred without two following hexadecimal digits.
&lt;/p&gt;
&lt;p&gt;
Can I suggest you try running the converted data through tdbcheck to
iron out any problems? Then I can push it into the Platform.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Fri, 28 May 2010 18:55:40 GMT</pubDate>
      <title>owner, priority, description changed; milestone set</title>
      <link>http://localhost/ticket/318#comment:2</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:2</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;rgrp&lt;/em&gt; to &lt;em&gt;dread&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;priority&lt;/strong&gt;
                changed from &lt;em&gt;major&lt;/em&gt; to &lt;em&gt;blocker&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;description&lt;/strong&gt;
              modified (&lt;a href="/ticket/318?action=diff&amp;amp;version=2"&gt;diff&lt;/a&gt;)
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                set to &lt;em&gt;v1.1&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>dread</dc:creator>

      <pubDate>Mon, 31 May 2010 15:44:37 GMT</pubDate>
      <title>owner changed</title>
      <link>http://localhost/ticket/318#comment:3</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:3</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;dread&lt;/em&gt; to &lt;em&gt;rgrp&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
We can't change any of the metadata without permission from the various departments who supplied it.
&lt;/p&gt;
&lt;p&gt;
I think "Don't shoot the messenger" is apt here.
&lt;/p&gt;
&lt;p&gt;
Adding this to the form validation isn't going to change any of the existing data. I think this is better off in the data quality scoring.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Fri, 11 Jun 2010 15:49:03 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:4</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:4</guid>
      <description>
        &lt;p&gt;
url validation reputed to be here:
&lt;/p&gt;
&lt;p&gt;
&lt;a class="ext-link" href="http://www.livinglogic.de/Python/url/Howto.html"&gt;&lt;span class="icon"&gt;​&lt;/span&gt;http://www.livinglogic.de/Python/url/Howto.html&lt;/a&gt;
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Sun, 13 Jun 2010 14:19:53 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:5</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:5</guid>
      <description>
        &lt;p&gt;
Some good news, ll.url seems to take bad urls and make them into good urls.
&lt;/p&gt;
&lt;p&gt;
viz:
&lt;/p&gt;
&lt;pre class="wiki"&gt;In [1]: from ll import url
In [2]: print url.URL("https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&amp;amp;ref=3.02.16")
------&amp;gt; print(url.URL("https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%3&amp;amp;ref=3.02.16"))
/Users/ww/Work/OKF/ckanrdf/lib/python2.6/site-packages/ll/url.py:2358: UserWarning: truncated escape at position 4
  value = _unescape(namevalue[1].replace("+", " "))
https://mqi.ic.nhs.uk/IndicatorDataView.aspx?query=NRLS%253&amp;amp;ref=3%2E02%2E16
&lt;/pre&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Sun, 13 Jun 2010 14:20:32 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:6</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:6</guid>
      <description>
        &lt;p&gt;
Also fyi, getting ll.url is done like so
&lt;/p&gt;
&lt;p&gt;
{{
pip install ll-xist
}}}
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Sun, 13 Jun 2010 14:21:47 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:7</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:7</guid>
      <description>
        &lt;p&gt;
I've updated ckanrdf to strip out datatypes and use this ll.url on external references so that should be sufficient to hold off talis.
&lt;/p&gt;
&lt;p&gt;
Still need to work particularly on validating dates though...
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Mon, 14 Jun 2010 13:11:45 GMT</pubDate>
      <title>owner changed</title>
      <link>http://localhost/ticket/318#comment:8</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:8</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;rgrp&lt;/em&gt; to &lt;em&gt;dread&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Mon, 02 Aug 2010 08:24:47 GMT</pubDate>
      <title>owner changed; milestone deleted</title>
      <link>http://localhost/ticket/318#comment:9</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:9</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;dread&lt;/em&gt; to &lt;em&gt;rgrp&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                &lt;em&gt;v1.1&lt;/em&gt; deleted
            &lt;/li&gt;
          &lt;/ul&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Mon, 02 Aug 2010 08:27:00 GMT</pubDate>
      <title>owner changed</title>
      <link>http://localhost/ticket/318#comment:10</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:10</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;rgrp&lt;/em&gt; to &lt;em&gt;pudo&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Important but low priority according to CO so bumping into next milestone (v1.2). NB: did not seem able to update milestone in trac interface! (Perhaps due to agilo stuff?)
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Mon, 30 Aug 2010 14:49:28 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:11</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:11</guid>
      <description>
        &lt;p&gt;
CO may not realise the implications when they said it was low priority. The implication of this lack of validation is that it is impossible to generate valid URIs in the RDF which means it cannot be imported by Talis. So until there is a solution to this, no RDF catalog.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>rgrp</dc:creator>

      <pubDate>Sat, 29 Jan 2011 22:39:28 GMT</pubDate>
      <title>priority changed</title>
      <link>http://localhost/ticket/318#comment:12</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:12</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;priority&lt;/strong&gt;
                changed from &lt;em&gt;blocker&lt;/em&gt; to &lt;em&gt;awaiting triage&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Still not sure what the priority is so moving to awaiting triage.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>pudo</dc:creator>

      <pubDate>Mon, 31 Jan 2011 09:48:28 GMT</pubDate>
      <title>status changed; resolution set</title>
      <link>http://localhost/ticket/318#comment:13</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:13</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;new&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;wontfix&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
This will be implicit in &lt;a class="closed ticket" href="http://localhost/ticket/852" title="enhancement: [super] Dataset upload and archiving (closed: fixed)"&gt;#852&lt;/a&gt;, thus not building something specific for it now.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>wwaites</dc:creator>

      <pubDate>Mon, 31 Jan 2011 13:54:09 GMT</pubDate>
      <title>priority, status changed; resolution deleted</title>
      <link>http://localhost/ticket/318#comment:14</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:14</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;priority&lt;/strong&gt;
                changed from &lt;em&gt;awaiting triage&lt;/em&gt; to &lt;em&gt;major&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;closed&lt;/em&gt; to &lt;em&gt;reopened&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                &lt;em&gt;wontfix&lt;/em&gt; deleted
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
We still require form validation to check URIs. They are not free-form strings. This is not the same as 852 or necessarily included in it.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>thejimmyg</dc:creator>

      <pubDate>Wed, 20 Jul 2011 15:41:37 GMT</pubDate>
      <title>owner, status changed; repo, theme, milestone set</title>
      <link>http://localhost/ticket/318#comment:15</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:15</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;
              changed from &lt;em&gt;pudo&lt;/em&gt; to &lt;em&gt;johnglover&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;repo&lt;/strong&gt;
                set to &lt;em&gt;ckan&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;theme&lt;/strong&gt;
                set to &lt;em&gt;none&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;reopened&lt;/em&gt; to &lt;em&gt;assigned&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;milestone&lt;/strong&gt;
                set to &lt;em&gt;ckan-current-sprint&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
Assigning to John so that he can see whether the QA code correctly flags these kinds of problems. If it does, we can close this ticket because although the API will serve invalid URLs, the publishers will be notified to clean up.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>johnglover</dc:creator>

      <pubDate>Wed, 27 Jul 2011 12:44:43 GMT</pubDate>
      <title>status changed; resolution set</title>
      <link>http://localhost/ticket/318#comment:16</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:16</guid>
      <description>
          &lt;ul&gt;
            &lt;li&gt;&lt;strong&gt;status&lt;/strong&gt;
                changed from &lt;em&gt;assigned&lt;/em&gt; to &lt;em&gt;closed&lt;/em&gt;
            &lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;resolution&lt;/strong&gt;
                set to &lt;em&gt;fixed&lt;/em&gt;
            &lt;/li&gt;
          &lt;/ul&gt;
        &lt;p&gt;
The QA code should identify invalid URLs. Resources with invalid urls will have an 'openness_score' of 0 and an 'openness_score_reason' of 'Invalid url scheme' or 'invalid URL'.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item><item>
      
        <dc:creator>dread</dc:creator>

      <pubDate>Tue, 09 Oct 2012 10:31:02 GMT</pubDate>
      <title></title>
      <link>http://localhost/ticket/318#comment:17</link>
      <guid isPermaLink="false">http://localhost/ticket/318#comment:17</guid>
      <description>
        &lt;p&gt;
Here's a real example - one of many from MOD
&lt;tt&gt;http://www.dasa.mod.uk/applications/newWeb/www/index.php?page=48&amp;amp;thiscontent=180&amp;amp;date=2011-05-26&amp;amp;pubType=1&amp;amp;PublishTime=09:30:00&amp;amp;from=home&amp;amp;tabOption=1&lt;/tt&gt;
Browsers accept colons and slashes happily, which is the main usage of our links. The URL looks better with the colons and slashes, rather than the encoded version. The average departmental user doesn't understand that the reason to encode them is for some academic RFC and RDF which is not "liberal in what it accepts". Since the RDF tool has a satisfactory way to encode links, this problem is essentially solved. Therefore I'm changing ckanext-archiver to accept these unencoded links, I'm afraid.
&lt;/p&gt;
      </description>
      <category>Ticket</category>
    </item>
 </channel>
</rss>