Ticket #698 (closed task: fixed)
CKAN Data API v1
Reported by: | thejimmyg | Owned by: | Stiivi |
---|---|---|---|
Priority: | critical | Milestone: | ckan-v1.3-sprint-1 |
Component: | ckan | Keywords: | |
Cc: | Repository: | ||
Theme: |
Description
This proposal is to discuss adding a new API for proxying certain spreadsheet data via JSON-P to make it possible to build simple browser apps directly off the API.
See the attached proposal for information.
Attachments
Change History
comment:1 Changed 3 years ago by Stiivi
- Priority set to awaiting triage
- Component set to ckan
I see two possible options:
Option A: store only mirrors of source files, have file format based plugins for querying files
Option B: store mirrors of source files, have plugin based loading scripts into "common structured format", have single query module.
I would go with option B as it is:
- easier to implement - file format based transformations are simpler than file format based queries
- more transparent data management process
- only one simple query module
(see attached ckan-srcmirror.png)
The Option B will fit better to the broader data architecture context:
http://democracyfarm.org/f/ckan/data_arch.png
Concerning API I would suggest to try to be compatible with google spreadsheet API:
http://code.google.com/apis/spreadsheets/data/3.0/reference.html
Changed 3 years ago by Stiivi
- Attachment ckan-srcmirror.png added
CKAN Source Mirror and transformations for Data API
comment:2 Changed 3 years ago by thejimmyg
Actually we've implemented a first version which doesn't store the data.
See this post: http://blog.ckan.org/2010/12/04/open-data-day-announcing-ckan-data-proxy/
You can get data like this:
comment:3 Changed 3 years ago by Stiivi
@thejimmyg: It is neat simple solution.
You have suggested a proxy API:
There will be a new API at /api/spreadsheet?callback=jsonpcallback&url=
There are two options:
- Have public ckan data proxy as stand-alone service: I get package resource URL from CKAN and pass it to proxy
- Have ckan data API (as ticket title suggests): If I am talking to CKAN, I am getting data from CKAN, I should not care about proxy or anything behind nor I should care about original data source - I care about resource data in a format that I can process (CSV/JSON).
For CKAN data API I would suggest something like:
/api/resource_data/RESOURCE_ID?...
or more human readable:
/api/resource_data/PACKAGE_NAME/RESOURCE_NUMBER?...
This will allow others to get only CKAN resources. Moreover, allowing to get only resource data (not any URL data) would allow us to pre-process resources in the future.
First version/implementation: pass each requested resource URL to your proxy service (external, not CKAN related), which determines file by file extension in URL, fail on unknown file or unprocessable file.
/api/resource_data/PACKAGE/RESOURCE?output=jsonp&sheet=1...
would be redirected to (for example):
http://1.latest.jsonpdataproxy.appspot.com/?url=RESOURCE["URL"]&sheet=1...
Second version/implementation: Determine file type in advance and pass to appropriate conversion service when requested
If you upload document on scribd or slideshare it gets processed in the background. This can be done in CKAN after any resource change. We do not need to download the file at the moment, however what can be done is:
- try a converter by URL file extension
- try a converter by MIME type (content-type header)
- brute-force try all converters
No need to store copies of files, just store determined file type somewhere in the resource record (as mime type).
Also, it would be nice if any data conversion service would provide output in both - JSON/CSV. Therefore we would be able to have "Download CSV" link directly in CKAN web page for browsing users:
/api/resource_data/PACKAGE/RESOURCE?output=csv...
comment:4 Changed 3 years ago by Stiivi
I have created "proof of concept" implementation that will use external data proxy service when accessing:
/api/data/PACKAGE_ID
like:
http://127.0.0.1:5000/api/data/069c80f8-8476-452e-bfd4-0a9077666c14
It just works and requires refactoring to match ckan standards. I would need help from soneone who knows ckan internals better.
comment:5 Changed 3 years ago by Stiivi
One more note: it would be good if packages had names/identifiers as well, as referencing internal IDs from outside world is not very good practice - they are quite volatile, mostly in regard to expected objects.
PACKAGE/RESOURCE_REFERENCE
Possible resource references:
- 'default' - reserved keyword for 'the only one resource' if there is only one, or first resource if there are more or the one with flag 'default'
- 'latest' - to be able to access 'latest' resource within package (or 'actual' or 'last'?)
- alphanumeric identifier (not starting with number)
- number - index of resource as human/visitor sees it on page (not the same as "position" attribute - as that one might contain gaps or be different (and it is in some cases)), index of resource should be something like:
SELECT package_id, id, url, ROW_NUMBER() OVER (PARTITION BY package_id ORDER BY position) AS index FROM package_resource
comment:6 Changed 3 years ago by Stiivi
'draft": https://github.com/Stiivi/ckanext-dataapi
requires that the client handles HTTP 302 Redirect correctly.
comment:7 Changed 3 years ago by rgrp
- Owner changed from rgrp to Stiivi
- Priority changed from awaiting triage to critical
- Milestone set to ckan-v1.3-sprint-1
- move repo to bitbucket
- clone james proxy code and modify to make google spreadsheets compatible (add a test ...)
- update the ckanext to pass on parameters ....
- Deploy all of this to test.ckan.net
- Rufus: check redirects with javascript
comment:8 Changed 3 years ago by Stiivi
Here is the fork for (json) data proxy:
https://bitbucket.org/Stiivi/dataproxy
I've refactored it and moved transformations into separate modules. For each resource type there should be a module in transform/<type>_transform.py
Each module should implement transform(flow, url, query) and should return a dictionary as a result.
Existing modules:
- transform/csv_transform - CSV files
- transform/xls_transform - Excel XLS files
if there is no resource_type module, HTTP 200 Error Resource type not supported is returned.
You can override URL file extension or specify type if extension is missing through type= URL option. For example if you have any URL that contains CSV data however the url is just foo.com/data then you can pass: url=http://foo.com/data&type=csv
Note: Source refactored/updated in example/dataproxy, being tested by running locally localhost:8000.
comment:9 Changed 3 years ago by Stiivi
pushed parameter passing; change handling of unknown reply type on proxy side: do not raise exception, but reply with 200 Error - unkown reply type, use json/jsonp
comment:10 Changed 3 years ago by anonymous
Data proxy documentation: http://democracyfarm.org/dataproxy/api.html (included in sources)
Updated ('s' as in structured) data proxy app: http://sdataproxy.appspot.com
comment:11 Changed 3 years ago by rgrp
- Status changed from new to closed
- Resolution set to fixed
This ticket is complete:
- ckanext-dataapi: working /api/data/{resource-id} with tests
- https://bitbucket.org/okfn/dataproxy - the dataproxy code running at http://jsonpdataproxy.appspot.com
- functioning but needs tests and improvements
There a whole bunch of improvements to be done but these will be in ticket:888