Ticket #698: data-api-jsonp-proxy.txt

File data-api-jsonp-proxy.txt, 6.5 KB (added by rgrp, 4 years ago)
Line 
1Proposal for JSON-P Proxy
2=========================
3
4Motivation
5----------
6
7At the moment there is no way for a client-side developer to build an
8application which accesses the data on data.gov.uk without first setting up and
9installing a server to fetch the data to send it to the browser.
10
11.. note ::
12
13    The reason for this is that the *same origin policy* prevents a browser
14    fetching data from multiple different domains unless it is in JSON-P format.
15    Since the data associated with a package be hosted anywhere,
16
17My thinking is that:
18
19* A JSON-P feed of the data would lower the barrier to entry to app builders:
20    * Every web developer knows how to build services with JavaScript, regardless of their main language so giving
21    * Even non-developers should be to hack together charts etc
22* The more people that try to consume the data, the more feedback data publishers will get about how to publish their data in a useful way
23* By concentrating on the browser as a platform more useful oppurtunites may emerge (eg package previews, SQLite distribution format etc)
24
25
26Proposal
27--------
28
29The proposal is not to implement everything in one go but to start with a common format and look at a way of building a useful feed. I propose starting with spreadsheet-like data for two reasons:
30
31* This is most easily understood by the type of developer who would build a broweser-side mashup or plot a graph
32* There are already solutions for handling XML, rdf data etc not least the SPARQL feeds
33
34We therefore want an API for proxying spreadsheet data via JSON-P.
35
36There will be a new API at ``/api/spreadsheet?callback=jsonpcallback&url=``
37
38The URL to fetch will be URL-encoded and passed with the ``url`` parameter. On
39the server the URL will be decoded and checked to see if it really is a URL
40linked to via a package resource and that it is an ``.xls`` or ``.csv`` file.
41
42The API will allow the following:
43
44* Downloading the entire spreadsheet
45* Downloading a single sheet (add ``sheet=1`` to the URL)
46* Downloading a range in a single sheet (add ``range=A1:K3`` to the URL) [a bit nasty for CSV files but will do I think]
47* Choosing a limited set of rows within the sheet (add ``row=5&row=7&row_range=10:100000:5000`` - rowrange format would be give me a row between 10 and 100000 every 5000 rows)
48
49The result will look like this with only the appropriate bits populated. For ``.xls`` files:
50
51::
52
53    {
54        url = http://...file.xls
55        option = 'row=5&row=7&row_range=10:100000:5000',
56        name = ['Sheet 1', 'Sheet 2'],
57        sheet: {
58            'Sheet 1': [
59                [...],
60                [...],
61                [...],
62            ]
63        }
64    }
65
66For ``.csv`` files:
67
68::
69
70    {
71        url = http://...file.csv
72        option = 'row=5&row=7&row_range=10:100000:5000',
73        'data': [
74            [...],
75            [...],
76            [...],
77        ]
78    }
79
80
81Hurdles
82-------
83
84* Some data sets are not in text-based formats => Don't handle them at this stage
85* Excel spreadhseets have formatting and different types => Ignore it, turn everything into a string for now
86* Some data sets are huge => don't proxy more than 100K of data - up to the user to filter it down if needed
87* We don't want to re-download data sets => Need a way to cache data -> storage API
88* Some applications might be wildly popular and put strain on the system -> perhaps API keys and rate limiting are needed so that individual apps/feeds can be disabled. How can we have read API keys on data.gov.uk?
89
90
91Next Steps
92----------
93
94* Investigate how this could be integrated with the proposed storage API
95* Investigate rate limiting in Squid
96
97
98Links for me to investigate
99---------------------------
100
101* http://www.faqs.org/docs/Linux-HOWTO/Bandwidth-Limiting-HOWTO.html
102* http://www.scribd.com/doc/8622975/Use-Squid-to-reduce-bandwidth
103
104Appendix 1
105----------
106
107The start of a prrof of concept:
108
109::
110
111    ubuntu@ckan-dev:~/env/src/ckan$ hg diff
112    diff -r 79260056ec71 ckan/config/routing.py
113    --- a/ckan/config/routing.py        Mon Oct 04 14:21:50 2010 +0000
114    +++ b/ckan/config/routing.py        Mon Oct 11 08:45:16 2010 +0000
115    @@ -210,6 +210,8 @@
116         map.connect('/api/2/qos/throughput/',
117             controller='rest', action='throughput',
118             conditions=dict(method=['GET']))
119    +    # James's experimental proxy code.
120    +    map.connect('/api/2/data', controller='data', action='index')
121     
122         map.redirect("/packages", "/package")
123         map.redirect("/packages/{url:.*}", "/package/{url}")
124    diff -r 79260056ec71 ckan/controllers/data.py
125    --- /dev/null       Thu Jan 01 00:00:00 1970 +0000
126    +++ b/ckan/controllers/data.py      Mon Oct 11 08:45:16 2010 +0000
127    @@ -0,0 +1,66 @@
128    +import logging
129    +import urllib2
130    +import xlrd
131    +import csv
132    +import ckan.authz
133    +import ckan.model as model
134    +import ckan
135    +from StringIO import StringIO
136    +from ckan.lib.base import BaseController, render, abort
137    +from ckan import model
138    +from ckan.model import meta
139    +from sqlalchemy.sql import select, and_
140    +from ckan.lib.base import _, request, response
141    +from ckan.lib.cache import ckan_cache
142    +from ckan.lib.helpers import json
143    +from ckan.controllers.apiv1.package import PackageController as _PackageV1Controller
144    +from ckan.controllers.apiv2.package import Rest2Controller
145    +
146    +log = logging.getLogger(__name__)
147    +
148    +class DataController(Rest2Controller, _PackageV1Controller):
149    +    def _last_modified(self, id):
150    +        """
151    +        Return most recent timestamp for this package
152    +        """
153    +        return model.Package.last_modified(model.package_table.c.id == id)
154    +
155    +    #@ckan_cache(test=_last_modified, query_args=True)
156    +    def index(self):
157    +        """
158    +        Return the specified package
159    +        """
160    +        url = request.params['url']
161    +        fp = urllib2.urlopen(url)
162    +        raw = fp.read()
163    +        fp.close()
164    +        book = xlrd.open_workbook('file', file_contents=raw, verbosity=0)
165    +        names = []
166    +        sheets = []
167    +        for sheet_name in book.sheet_names():
168    +            names.append(sheet_name)
169    +            sheet_ = book.sheet_by_name(sheet_name)
170    +            rows = []
171    +            for rownum in range(sheet_.nrows):
172    +                vals = sheet_.row_values(rownum)
173    +                rows.append(vals)
174    +            sheets.append(rows)
175    +        csvs = []
176    +        for i, sheet in enumerate(sheets):
177    +            csvs.append(dict(name=names[i], content=sheet))
178    +        return self._finish_ok(json.dumps(csvs))
179    +   
180