Distributed Data and Syncing Between CKAN Instances
Aim: to support pulling and pushing metadata between different CKAN registries while preserving history.
This problem has strong similarities to distributed version control and distributed databases.
Research
- Distributed version control systems:
- Mercurial -- for an overview Inside a Distributed VCS
- Distributed databases - NB: we are looking to preserve history (not just allow multiple distributed writes)
Use cases
1. data.gov.uk and ckan.net
Want to make data on data.gov.uk (hmg.ckan.net) available in a public CKAN instance. We will therefore end up with:
- Package on data.gov.uk
- Package on ckan.net
Need to keep these two representations of the package in "sync".
Remarks: This is easy if we only edit in one place. But what if we want to let community edit on ckan.net? Two options:
- have 2 copies on ckan.net one community owned and one locked down
- Pro: easy to keep stuff separate
- Con: terrible user experience and still have issue that two items can diverge
- Have one copy that is world editable into which gets *merged back* into the official data every so often
Detailed Problem Description
Let us consider different CKAN registries: A,B,C
Consider the following way we may want information to travel::
A ------- \ C A --> B--/
In words:
- Changes go directly from A to C
- Changes go directly from A to B. Then changes from B are pulled to C
ra1 ra2 A -----------> A' \ \ \ \ B---> B' ---> B'' (need to merge r2 and r3) ra1 rb1
Specification v1
Scenarios
- 1-way: On setup, Server A's packages are copied to Server B. On sync, changes to packages on Server A are transferred to Server B.
- 2-way: On setup, packages from each server is copied to the other. On sync, changes on each are transferred to the other.
We will focus now on 1-way, leaving 2-way for the future.
Requirements
- Merging of changes from both machines. If there is a conflict then it is logged and a result is chosen.
- Use of Server A and Server B continues undisturbed during sync.
Issues
- Clashes of package/tag/group names.
- Sync between CKAN instances of slightly different versions of ckan & vdm.
- Unversioned objects - make versioned? User, Group, Authz, Rating.
- How to test system.
- Copy authorization tables? Allow API access to objects not-authorized for reading by visitor?
Use cases
- First sync - all packages and revisions are copied from Server A to Server B.
- Subsequent sync after package changes on A and/or B.
- Sync after package purged on A. (Package also purged on B.)
- Sync after package purged on B. (Package not recreated on B.)
- Server B syncs at different times from a third server.
- Package/Tag/Group? name on Server A clashes with an existing one on Server B. Log all of them. Merge tag and group. Not sure about package.
- Objects on Server A with restricted authz are by default editable on Server B.
Operation
First sync - 10am
Server B asks "Give me all your revisions and unrevisioned objects."
Server A replies "Rev1 and associated revisions Pkg1Rev1, Pkg2Rev1, !PkgTagRev1, !PkgResource1; Rev2 with Pkg1Rev2; Tag1; User1; !PackageGroup1; Group1; ratings"
Server B creates Rev1, Pkg1Rev1, Pkg2Rev1, Pkg1Rev2, Pkg1, Pkg2, !PkgTagRev1, !PkgResource1, Tag1, User1, Auth, !PackageGroup1, Group1. UUIDs are the same as Server A.
Server B updates search vectors for Pkg1 and Pkg2.
Server records the time of the sync - 10am.
Meanwhile - 10.20am
On Server A, user updates Pkg1 twice, creating Rev2/Pkg1Rev3 and Rev3/Pkg1Rev4PkgTagRev2. User1 updates his name.
Meanwhile - 10.40am
On Server B, user updates Pkg1 once, creating Pkg1Rev5.
Sesequent sync
Server B asks "Give me revisions and diffs since 10am."
Server A replies Rev2/Pkg1Rev3, Rev3/Pkg1Rev4 and gives diff of Pkg1Rev2 -> Pkg1Rev4
Server B looks at its own revisions since 10am and sees Pkg1 now has two heads. It calculates diff of Pkg1Rev2 -> Pkg1Rev5.
Server B takes Pkg1Rev2 and applies the two diffs in the order of priority, logging any conflicts, calling the result Pkg1Rev6.
Tickets
- Sync set-up stored in config file (server URI). Last sync status stored in local db.
- Repository method 'all_revs_since'. It returns all revisions since a time/revision (or since the beginning).
- Object method 'diff'. It returns a Diff object which is the diff of two ObjectResources. Already exists for Package, but need for PackageTag, PackageExtra
- Revision method 'serialize'.
- Diff method 'serialize'.
- API access to revisions: /api/search/revision?since=ab49f348-fd23-ae3c
- API access to diffs: /api/diff/revision?diff=8f77c992-5eec-4909&oldid=ab49f348-fd23-ae3c
- API access to unrevisioned objects?
- API access to version of CKAN / vdm.
- Repository method 'import_revisions'. It takes serialised revisions and diffs and creates revision objects exactly matching spec.
- Object method 'merge_diffs'. It takes an original object and two diffs that apply to it and applies them both in a new revision.