Context Navigation

Ticket #235 (assigned enhancement)

Opened 4 years ago

Last modified 23 months ago

Resource format normalization and detection

Reported by:	dread	Owned by:	tobes
Priority:	awaiting triage	Milestone:	ckan-v1.9
Component:	ckan	Keywords:
Cc:		Repository:	ckan
Theme:	none

Description (last modified by pudo) (diff)

Try to gather proper MIME information for all package resources in CKAN. This is a shared ticket with dcat-tools (https://bitbucket.org/pudo/dcat-tools), i.e. opendatasearch.org. This can then also be used by ckanrdf, the CKAN RDF conversion service.

Sub-tasks:

Create a Google Spreadsheet with two Worksheets: "MIME-Mappings", i.e. "CSV" -> "text/csv" and "Name mappings", i.e. "text/csv" -> "Comma-Separated Spreadsheet".
Collect and map surface forms from all CKANs
Access this via Swiss and apply, store as a PackageResource? extra field pending #826 (Resource extras).
Add heuristics for format auto-detections:
- Map well-known file extensions
- Recognize obvious magic (Zip, Tar)
- Peek into Zipfile/Tarfiles?
Define a convention for generic data types (many CKAN packages have only "Spreadsheet" defined, either detect specific type or set MIME to */tabular-data or similar)
See also: #816 (Autocomplete for the resource format field)

Change History

comment:1 Changed 3 years ago by pudo

Description modified (diff)
Summary changed from Resource format detect from filename extension to Resource format normalization and detection

comment:2 Changed 3 years ago by pudo

Repository set to ckan
Theme set to none

The first three sub-items of this ticket are done in datautil and dcat-tools:

Basic GDocs-based normalizer:

https://bitbucket.org/okfn/datautil/src/8bba83b4fa45/datautil/normalization/table_based.py

Example of use:

https://bitbucket.org/okfn/dcat-tools/src/0ec5012bf12a/dcat/core/normalize.py#cl-32

Spreadsheet (as referenced in datautil source, should be a kwarg):

https://spreadsheets.google.com/ccc?key=0AplklDf0nYxWdE8tVlRrN1F3bG9PdDBFUDNZcENDNEE&hl=en#gid=0

Experience so far has been that Google rate limits the current implementation so we should perform all ops in one or two big calls rather than "piece by piece".

comment:3 follow-up: ↓ 4 Changed 23 months ago by icmurray

Owner changed from rgrp to tobes
Status changed from new to assigned
Milestone set to ckan-v1.9

Moved to ckan-future and unassign so that this ticket will be picked up in triage.

comment:4 in reply to: ↑ 3 Changed 23 months ago by icmurray

Replying to icmurray:

Moved to ckan-future and unassign so that this ticket will be picked up in triage.

Sorry, this comment bears no resemblemnce to what I actually did! Assigned to tobes for 1.9.

Note: See TracTickets for help on using tickets.

Download in other formats: