Ticket #698 (closed task: fixed)

Opened 4 years ago

Last modified 3 years ago

CKAN Data API v1

Reported by: thejimmyg Owned by: Stiivi
Priority: critical Milestone: ckan-v1.3-sprint-1
Component: ckan Keywords:
Cc: Repository:
Theme:

Description

This proposal is to discuss adding a new API for proxying certain spreadsheet data via JSON-P to make it possible to build simple browser apps directly off the API.

See the attached proposal for information.

Attachments

data-api-jsonp-proxy.txt (6.5 KB) - added by rgrp 4 years ago.
ckan-srcmirror.png (114.3 KB) - added by Stiivi 3 years ago.
CKAN Source Mirror and transformations for Data API
ckan-dataapi.png (229.4 KB) - added by Stiivi 3 years ago.
Data API through remote data proxy

Change History

Changed 4 years ago by rgrp

comment:1 Changed 3 years ago by Stiivi

  • Priority set to awaiting triage
  • Component set to ckan

I see two possible options:

Option A: store only mirrors of source files, have file format based plugins for querying files

Option B: store mirrors of source files, have plugin based loading scripts into "common structured format", have single query module.

I would go with option B as it is:

  • easier to implement - file format based transformations are simpler than file format based queries
  • more transparent data management process
  • only one simple query module

(see attached ckan-srcmirror.png)

The Option B will fit better to the broader data architecture context:

http://democracyfarm.org/f/ckan/data_arch.png

Concerning API I would suggest to try to be compatible with google spreadsheet API:

http://code.google.com/apis/spreadsheets/data/3.0/reference.html

Changed 3 years ago by Stiivi

CKAN Source Mirror and transformations for Data API

comment:2 Changed 3 years ago by thejimmyg

comment:3 Changed 3 years ago by Stiivi

@thejimmyg: It is neat simple solution.

You have suggested a proxy API:

There will be a new API at /api/spreadsheet?callback=jsonpcallback&url=

There are two options:

  1. Have public ckan data proxy as stand-alone service: I get package resource URL from CKAN and pass it to proxy
  1. Have ckan data API (as ticket title suggests): If I am talking to CKAN, I am getting data from CKAN, I should not care about proxy or anything behind nor I should care about original data source - I care about resource data in a format that I can process (CSV/JSON).

For CKAN data API I would suggest something like:

/api/resource_data/RESOURCE_ID?...

or more human readable:

/api/resource_data/PACKAGE_NAME/RESOURCE_NUMBER?...

This will allow others to get only CKAN resources. Moreover, allowing to get only resource data (not any URL data) would allow us to pre-process resources in the future.

First version/implementation: pass each requested resource URL to your proxy service (external, not CKAN related), which determines file by file extension in URL, fail on unknown file or unprocessable file.

/api/resource_data/PACKAGE/RESOURCE?output=jsonp&sheet=1...

would be redirected to (for example):

http://1.latest.jsonpdataproxy.appspot.com/?url=RESOURCE["URL"]&sheet=1...

Second version/implementation: Determine file type in advance and pass to appropriate conversion service when requested

If you upload document on scribd or slideshare it gets processed in the background. This can be done in CKAN after any resource change. We do not need to download the file at the moment, however what can be done is:

  1. try a converter by URL file extension
  2. try a converter by MIME type (content-type header)
  3. brute-force try all converters

No need to store copies of files, just store determined file type somewhere in the resource record (as mime type).

Also, it would be nice if any data conversion service would provide output in both - JSON/CSV. Therefore we would be able to have "Download CSV" link directly in CKAN web page for browsing users:

/api/resource_data/PACKAGE/RESOURCE?output=csv...

comment:4 Changed 3 years ago by Stiivi

I have created "proof of concept" implementation that will use external data proxy service when accessing:

/api/data/PACKAGE_ID

like:

http://127.0.0.1:5000/api/data/069c80f8-8476-452e-bfd4-0a9077666c14

It just works and requires refactoring to match ckan standards. I would need help from soneone who knows ckan internals better.

comment:5 Changed 3 years ago by Stiivi

One more note: it would be good if packages had names/identifiers as well, as referencing internal IDs from outside world is not very good practice - they are quite volatile, mostly in regard to expected objects.

PACKAGE/RESOURCE_REFERENCE

Possible resource references:

  • 'default' - reserved keyword for 'the only one resource' if there is only one, or first resource if there are more or the one with flag 'default'
  • 'latest' - to be able to access 'latest' resource within package (or 'actual' or 'last'?)
  • alphanumeric identifier (not starting with number)
  • number - index of resource as human/visitor sees it on page (not the same as "position" attribute - as that one might contain gaps or be different (and it is in some cases)), index of resource should be something like:
SELECT package_id, id, url, ROW_NUMBER() OVER (PARTITION BY package_id ORDER BY position) AS index FROM package_resource

comment:6 Changed 3 years ago by Stiivi

'draft": https://github.com/Stiivi/ckanext-dataapi

requires that the client handles HTTP 302 Redirect correctly.

comment:7 Changed 3 years ago by rgrp

  • Owner changed from rgrp to Stiivi
  • Priority changed from awaiting triage to critical
  • Milestone set to ckan-v1.3-sprint-1
  1. move repo to bitbucket
  2. clone james proxy code and modify to make google spreadsheets compatible (add a test ...)
  3. update the ckanext to pass on parameters ....
  4. Deploy all of this to test.ckan.net
  5. Rufus: check redirects with javascript

comment:8 Changed 3 years ago by Stiivi

Here is the fork for (json) data proxy:

https://bitbucket.org/Stiivi/dataproxy

I've refactored it and moved transformations into separate modules. For each resource type there should be a module in transform/<type>_transform.py

Each module should implement transform(flow, url, query) and should return a dictionary as a result.

Existing modules:

  • transform/csv_transform - CSV files
  • transform/xls_transform - Excel XLS files

if there is no resource_type module, HTTP 200 Error Resource type not supported is returned.

You can override URL file extension or specify type if extension is missing through type= URL option. For example if you have any URL that contains CSV data however the url is just foo.com/data then you can pass: url=http://foo.com/data&type=csv

Note: Source refactored/updated in example/dataproxy, being tested by running locally localhost:8000.

comment:9 Changed 3 years ago by Stiivi

pushed parameter passing; change handling of unknown reply type on proxy side: do not raise exception, but reply with 200 Error - unkown reply type, use json/jsonp

Changed 3 years ago by Stiivi

Data API through remote data proxy

comment:10 Changed 3 years ago by anonymous

Data proxy documentation: http://democracyfarm.org/dataproxy/api.html (included in sources)

Updated ('s' as in structured) data proxy app: http://sdataproxy.appspot.com

comment:11 Changed 3 years ago by rgrp

  • Status changed from new to closed
  • Resolution set to fixed

This ticket is complete:

There a whole bunch of improvements to be done but these will be in ticket:888

Note: See TracTickets for help on using tickets.