Context Navigation

Back to Ticket #698

Ticket #698: data-api-jsonp-proxy.txt

File data-api-jsonp-proxy.txt, 6.5 KB (added by rgrp, 4 years ago)

Line
1	Proposal for JSON-P Proxy
2	=========================
3
4	Motivation
5	----------
6
7	At the moment there is no way for a client-side developer to build an
8	application which accesses the data on data.gov.uk without first setting up and
9	installing a server to fetch the data to send it to the browser.
10
11	.. note ::
12
13	The reason for this is that the same origin policy prevents a browser
14	fetching data from multiple different domains unless it is in JSON-P format.
15	Since the data associated with a package be hosted anywhere,
16
17	My thinking is that:
18
19	* A JSON-P feed of the data would lower the barrier to entry to app builders:
20	* Every web developer knows how to build services with JavaScript, regardless of their main language so giving
21	* Even non-developers should be to hack together charts etc
22	* The more people that try to consume the data, the more feedback data publishers will get about how to publish their data in a useful way
23	* By concentrating on the browser as a platform more useful oppurtunites may emerge (eg package previews, SQLite distribution format etc)
24
25
26	Proposal
27	--------
28
29	The proposal is not to implement everything in one go but to start with a common format and look at a way of building a useful feed. I propose starting with spreadsheet-like data for two reasons:
30
31	* This is most easily understood by the type of developer who would build a broweser-side mashup or plot a graph
32	* There are already solutions for handling XML, rdf data etc not least the SPARQL feeds
33
34	We therefore want an API for proxying spreadsheet data via JSON-P.
35
36	There will be a new API at ``/api/spreadsheet?callback=jsonpcallback&url=``
37
38	The URL to fetch will be URL-encoded and passed with the ``url`` parameter. On
39	the server the URL will be decoded and checked to see if it really is a URL
40	linked to via a package resource and that it is an ``.xls`` or ``.csv`` file.
41
42	The API will allow the following:
43
44	* Downloading the entire spreadsheet
45	* Downloading a single sheet (add ``sheet=1`` to the URL)
46	* Downloading a range in a single sheet (add ``range=A1:K3`` to the URL) [a bit nasty for CSV files but will do I think]
47	* Choosing a limited set of rows within the sheet (add ``row=5&row=7&row_range=10:100000:5000`` - rowrange format would be give me a row between 10 and 100000 every 5000 rows)
48
49	The result will look like this with only the appropriate bits populated. For ``.xls`` files:
50
51	::
52
53	{
54	url = http://...file.xls
55	option = 'row=5&row=7&row_range=10:100000:5000',
56	name = ['Sheet 1', 'Sheet 2'],
57	sheet: {
58	'Sheet 1': [
59	[...],
60	[...],
61	[...],
62	]
63	}
64	}
65
66	For ``.csv`` files:
67
68	::
69
70	{
71	url = http://...file.csv
72	option = 'row=5&row=7&row_range=10:100000:5000',
73	'data': [
74	[...],
75	[...],
76	[...],
77	]
78	}
79
80
81	Hurdles
82	-------
83
84	* Some data sets are not in text-based formats => Don't handle them at this stage
85	* Excel spreadhseets have formatting and different types => Ignore it, turn everything into a string for now
86	* Some data sets are huge => don't proxy more than 100K of data - up to the user to filter it down if needed
87	* We don't want to re-download data sets => Need a way to cache data -> storage API
88	* Some applications might be wildly popular and put strain on the system -> perhaps API keys and rate limiting are needed so that individual apps/feeds can be disabled. How can we have read API keys on data.gov.uk?
89
90
91	Next Steps
92	----------
93
94	* Investigate how this could be integrated with the proposed storage API
95	* Investigate rate limiting in Squid
96
97
98	Links for me to investigate
99	---------------------------
100
101	* http://www.faqs.org/docs/Linux-HOWTO/Bandwidth-Limiting-HOWTO.html
102	* http://www.scribd.com/doc/8622975/Use-Squid-to-reduce-bandwidth
103
104	Appendix 1
105	----------
106
107	The start of a prrof of concept:
108
109	::
110
111	ubuntu@ckan-dev:~/env/src/ckan$ hg diff
112	diff -r 79260056ec71 ckan/config/routing.py
113	--- a/ckan/config/routing.py Mon Oct 04 14:21:50 2010 +0000
114	+++ b/ckan/config/routing.py Mon Oct 11 08:45:16 2010 +0000
115	@@ -210,6 +210,8 @@
116	map.connect('/api/2/qos/throughput/',
117	controller='rest', action='throughput',
118	conditions=dict(method=['GET']))
119	+ # James's experimental proxy code.
120	+ map.connect('/api/2/data', controller='data', action='index')
121
122	map.redirect("/packages", "/package")
123	map.redirect("/packages/{url:.*}", "/package/{url}")
124	diff -r 79260056ec71 ckan/controllers/data.py
125	--- /dev/null Thu Jan 01 00:00:00 1970 +0000
126	+++ b/ckan/controllers/data.py Mon Oct 11 08:45:16 2010 +0000
127	@@ -0,0 +1,66 @@
128	+import logging
129	+import urllib2
130	+import xlrd
131	+import csv
132	+import ckan.authz
133	+import ckan.model as model
134	+import ckan
135	+from StringIO import StringIO
136	+from ckan.lib.base import BaseController, render, abort
137	+from ckan import model
138	+from ckan.model import meta
139	+from sqlalchemy.sql import select, and_
140	+from ckan.lib.base import _, request, response
141	+from ckan.lib.cache import ckan_cache
142	+from ckan.lib.helpers import json
143	+from ckan.controllers.apiv1.package import PackageController as _PackageV1Controller
144	+from ckan.controllers.apiv2.package import Rest2Controller
145	+
146	+log = logging.getLogger(__name__)
147	+
148	+class DataController(Rest2Controller, _PackageV1Controller):
149	+ def _last_modified(self, id):
150	+ """
151	+ Return most recent timestamp for this package
152	+ """
153	+ return model.Package.last_modified(model.package_table.c.id == id)
154	+
155	+ #@ckan_cache(test=_last_modified, query_args=True)
156	+ def index(self):
157	+ """
158	+ Return the specified package
159	+ """
160	+ url = request.params['url']
161	+ fp = urllib2.urlopen(url)
162	+ raw = fp.read()
163	+ fp.close()
164	+ book = xlrd.open_workbook('file', file_contents=raw, verbosity=0)
165	+ names = []
166	+ sheets = []
167	+ for sheet_name in book.sheet_names():
168	+ names.append(sheet_name)
169	+ sheet_ = book.sheet_by_name(sheet_name)
170	+ rows = []
171	+ for rownum in range(sheet_.nrows):
172	+ vals = sheet_.row_values(rownum)
173	+ rows.append(vals)
174	+ sheets.append(rows)
175	+ csvs = []
176	+ for i, sheet in enumerate(sheets):
177	+ csvs.append(dict(name=names[i], content=sheet))
178	+ return self._finish_ok(json.dumps(csvs))
179	+
180

Download in other formats:

Original Format