Ticket #888 (closed enhancement: fixed)

Opened 3 years ago

Last modified 3 years ago

Improvements to the dataproxy and the data API

Reported by: rgrp Owned by: johnglover
Priority: major Milestone: ckan-sprint-2011-10-28
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description

First version of dataproxy and data API working (ticket:698) but have identified a variety of important improvements. (Should split these into sub-tickets ...):

For dataproxy:

  • Testing for dataproxy
    • Can start by using known good remote urls (moving forward could switch to providing/mocking these locally)
  • Remove content-lenght for csv requirement: just read the first x rows (up to some configurable maximum)
  • Google docs style row/column selections
  • Use the swiss library - https://bitbucket.org/okfn/swiss
    • Support google docs spreadsheets (format = service/gdocs/ccc or gdocs/ccc or gdocs/spreadsheet)
  • Handle redirects for content-length?
  • Ignore resource type if not recognized and fall-back to trying to identify from extension (or mime-type?)

For dataapi:

  • Ensure we pass on resource format as part of redirect i.e. /api/data/{id} -> {dataproxy}?url={resource-url}&type={resource-type}

Change History

comment:1 Changed 3 years ago by Stiivi

Chages to Data Proxy:

  • tests added with configurable list of known URLs
  • use brewery for transformations (included reference to brewery framework in a new vendor directory)
  • side effect: code to make google find external packages in vendor directory (from now on, all external packages should go there and be referenced from .hgsub if they are mercurial repositories)
  • changed response contents: moved from 'headers' to root, renamed 'response' to 'data', added field list as 'fields'
  • changed way of registering transformers (now class object is used instead of name)
  • added 'encoding' and 'dialect' parameters for CSV
  • added optional data audit (parameter 'audit')

Changes: https://bitbucket.org/Stiivi/dataproxy/changeset/fccbdd275be5

Data information: http://databrewery.org/doc/data_quality.html#brewery.dq.FieldStatistics

comment:2 Changed 3 years ago by thejimmyg

  • Owner changed from Stiivi to thejimmyg
  • Repository set to ckan
  • Theme set to none
  • Status changed from new to assigned

I don't think any progress has been made on this for a bit so I'm assigning it to me.

comment:3 Changed 3 years ago by shevski

  • Owner changed from thejimmyg to johnglover
  • Milestone changed from ckan-v1.5 to ckan-current-sprint

comment:4 Changed 3 years ago by johnglover

  • Status changed from assigned to closed
  • Resolution set to fixed

Dataproxy / Dataapi now deprecated in favour of the combination of new QA archive / process commands and the webstore.

Changes in relation to Dataproxy / Dataapi:

  • Currently only supports CSV files, but plans to add support for excel and google docs spreadsheets soon.
  • Uses David Raznick's CSV parser instead of Brewery for parsing, handles messy CSV data better.

Changes in relation to old QA functionality:

  • decoupled archiving (downloading) and QA process
  • added a new 'process' command which parses downloaded files and adds them to a local webstore

Closing for now, any improvements/feature requests should be in tickets relating to either the QA functionality or the webstore.

Note: See TracTickets for help on using tickets.