Ticket #235 (assigned enhancement)
Resource format normalization and detection
Reported by: | dread | Owned by: | tobes |
---|---|---|---|
Priority: | awaiting triage | Milestone: | ckan-v1.9 |
Component: | ckan | Keywords: | |
Cc: | Repository: | ckan | |
Theme: | none |
Description (last modified by pudo) (diff)
Try to gather proper MIME information for all package resources in CKAN. This is a shared ticket with dcat-tools (https://bitbucket.org/pudo/dcat-tools), i.e. opendatasearch.org. This can then also be used by ckanrdf, the CKAN RDF conversion service.
Sub-tasks:
- Create a Google Spreadsheet with two Worksheets: "MIME-Mappings", i.e. "CSV" -> "text/csv" and "Name mappings", i.e. "text/csv" -> "Comma-Separated Spreadsheet".
- Collect and map surface forms from all CKANs
- Access this via Swiss and apply, store as a PackageResource? extra field pending #826 (Resource extras).
- Add heuristics for format auto-detections:
- Map well-known file extensions
- Recognize obvious magic (Zip, Tar)
- Peek into Zipfile/Tarfiles?
- Define a convention for generic data types (many CKAN packages have only "Spreadsheet" defined, either detect specific type or set MIME to */tabular-data or similar)
- See also: #816 (Autocomplete for the resource format field)
Change History
comment:1 Changed 3 years ago by pudo
- Description modified (diff)
- Summary changed from Resource format detect from filename extension to Resource format normalization and detection
comment:2 Changed 3 years ago by pudo
- Repository set to ckan
- Theme set to none
The first three sub-items of this ticket are done in datautil and dcat-tools:
Basic GDocs-based normalizer:
Example of use:
Spreadsheet (as referenced in datautil source, should be a kwarg):
Experience so far has been that Google rate limits the current implementation so we should perform all ops in one or two big calls rather than "piece by piece".
Note: See
TracTickets for help on using
tickets.