Ticket #235 (assigned enhancement)

Opened 4 years ago

Last modified 23 months ago

Resource format normalization and detection

Reported by: dread Owned by: tobes
Priority: awaiting triage Milestone: ckan-v1.9
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description (last modified by pudo) (diff)

Try to gather proper MIME information for all package resources in CKAN. This is a shared ticket with dcat-tools (https://bitbucket.org/pudo/dcat-tools), i.e. opendatasearch.org. This can then also be used by ckanrdf, the CKAN RDF conversion service.

Sub-tasks:

  • Create a Google Spreadsheet with two Worksheets: "MIME-Mappings", i.e. "CSV" -> "text/csv" and "Name mappings", i.e. "text/csv" -> "Comma-Separated Spreadsheet".
  • Collect and map surface forms from all CKANs
  • Access this via Swiss and apply, store as a PackageResource? extra field pending #826 (Resource extras).
  • Add heuristics for format auto-detections:
    • Map well-known file extensions
    • Recognize obvious magic (Zip, Tar)
    • Peek into Zipfile/Tarfiles?
  • Define a convention for generic data types (many CKAN packages have only "Spreadsheet" defined, either detect specific type or set MIME to */tabular-data or similar)
  • See also: #816 (Autocomplete for the resource format field)

Change History

comment:1 Changed 3 years ago by pudo

  • Description modified (diff)
  • Summary changed from Resource format detect from filename extension to Resource format normalization and detection

comment:2 Changed 3 years ago by pudo

  • Repository set to ckan
  • Theme set to none

The first three sub-items of this ticket are done in datautil and dcat-tools:

Basic GDocs-based normalizer:

Example of use:

Spreadsheet (as referenced in datautil source, should be a kwarg):

Experience so far has been that Google rate limits the current implementation so we should perform all ops in one or two big calls rather than "piece by piece".

comment:3 follow-up: ↓ 4 Changed 23 months ago by icmurray

  • Owner changed from rgrp to tobes
  • Status changed from new to assigned
  • Milestone set to ckan-v1.9

Moved to ckan-future and unassign so that this ticket will be picked up in triage.

comment:4 in reply to: ↑ 3 Changed 23 months ago by icmurray

Replying to icmurray:

Moved to ckan-future and unassign so that this ticket will be picked up in triage.

Sorry, this comment bears no resemblemnce to what I actually did! Assigned to tobes for 1.9.

Note: See TracTickets for help on using tickets.