Ticket #235 (new enhancement) — at Version 1

Opened 4 years ago

Last modified 23 months ago

Resource format normalization and detection

Reported by: dread Owned by: rgrp
Priority: awaiting triage Milestone: ckan-v1.9
Component: ckan Keywords:
Cc: Repository: ckan
Theme: none

Description (last modified by pudo) (diff)

Try to gather proper MIME information for all package resources in CKAN. This is a shared ticket with dcat-tools (https://bitbucket.org/pudo/dcat-tools), i.e. opendatasearch.org. This can then also be used by ckanrdf, the CKAN RDF conversion service.

Sub-tasks:

  • Create a Google Spreadsheet with two Worksheets: "MIME-Mappings", i.e. "CSV" -> "text/csv" and "Name mappings", i.e. "text/csv" -> "Comma-Separated Spreadsheet".
  • Collect and map surface forms from all CKANs
  • Access this via Swiss and apply, store as a PackageResource? extra field pending #826 (Resource extras).
  • Add heuristics for format auto-detections:
    • Map well-known file extensions
    • Recognize obvious magic (Zip, Tar)
    • Peek into Zipfile/Tarfiles?
  • Define a convention for generic data types (many CKAN packages have only "Spreadsheet" defined, either detect specific type or set MIME to */tabular-data or similar)
  • See also: #816 (Autocomplete for the resource format field)

Change History

comment:1 Changed 3 years ago by pudo

  • Description modified (diff)
  • Summary changed from Resource format detect from filename extension to Resource format normalization and detection
Note: See TracTickets for help on using tickets.