= Distributed Data and Syncing Between CKAN Instances = Aim: '''to support pulling and pushing metadata between different CKAN registries while ''preserving'' history.''' This problem has strong similarities to distributed version control and distributed databases. = Research = * Distributed version control systems: * Mercurial -- for an overview [http://mercurial.selenic.com/wiki/Mercurial?action=AttachFile&do=get&target=Hague2009.pdf Inside a Distributed VCS] * Distributed databases - NB: we are looking to preserve history (not just allow multiple distributed writes) = Use cases = == 1. data.gov.uk and ckan.net == Want to make data on data.gov.uk (hmg.ckan.net) available in a public CKAN instance. We will therefore end up with: 1. Package on data.gov.uk 2. Package on ckan.net Need to keep these two representations of the package in "sync". Remarks: This is easy if we only edit in one place. But what if we want to let community edit on ckan.net? Two options: 1. have 2 copies on ckan.net one community owned and one locked down * Pro: easy to keep stuff separate * Con: terrible user experience and still have issue that two items can diverge 2. Have one copy that is world editable into which gets *merged back* into the official data every so often = Detailed Problem Description = Let us consider different CKAN registries: A,B,C Consider the following way we may want information to travel:: {{{ A ------- \ C A --> B--/ }}} In words: 1. Changes go directly from A to C 2. Changes go directly from A to B. Then changes from B are pulled to C {{{ ra1 ra2 A -----------> A' \ \ \ \ B---> B' ---> B'' (need to merge r2 and r3) ra1 rb1 }}} = Specification v1 = == Scenarios == * 1-way: On setup, Server A's packages are copied to Server B. On sync, changes to packages on Server A are transferred to Server B. * 2-way: On setup, packages from each server is copied to the other. On sync, changes on each are transferred to the other. We will focus now on 1-way, leaving 2-way for the future. == Requirements == * Merging of changes from both machines. If there is a conflict then it is logged and a result is chosen. * Use of Server A and Server B continues undisturbed during sync. == Issues == * Clashes of package/tag/group names. * Sync between CKAN instances of slightly different versions of ckan & vdm. * Unversioned objects - make versioned? User, Group, Authz, Rating. * How to test system. * Copy authorization tables? Allow API access to objects not-authorized for reading by visitor? == Use cases == * First sync - all packages and revisions are copied from Server A to Server B. * Subsequent sync after package changes on A and/or B. * Sync after package purged on A. (Package also purged on B.) * Sync after package purged on B. (Package not recreated on B.) * Server B syncs at different times from a third server. * Package/Tag/Group name on Server A clashes with an existing one on Server B. Log all of them. Merge tag and group. Not sure about package. * Objects on Server A with restricted authz are by default editable on Server B. == Operation == === First sync - 10am === Server B asks "Give me all your revisions and unrevisioned objects." Server A replies "Rev1 and associated revisions Pkg1Rev1, Pkg2Rev1, !PkgTagRev1, !PkgResource1; Rev2 with Pkg1Rev2; Tag1; User1; !PackageGroup1; Group1; ratings" Server B creates Rev1, Pkg1Rev1, Pkg2Rev1, Pkg1Rev2, Pkg1, Pkg2, !PkgTagRev1, !PkgResource1, Tag1, User1, Auth, !PackageGroup1, Group1. UUIDs are the same as Server A. Server B updates search vectors for Pkg1 and Pkg2. Server records the time of the sync - 10am. === Meanwhile - 10.20am === On Server A, user updates Pkg1 twice, creating Rev2/Pkg1Rev3 and Rev3/Pkg1Rev4PkgTagRev2. User1 updates his name. === Meanwhile - 10.40am === On Server B, user updates Pkg1 once, creating Pkg1Rev5. === Sesequent sync === Server B asks "Give me revisions and diffs since 10am." Server A replies Rev2/Pkg1Rev3, Rev3/Pkg1Rev4 and gives diff of Pkg1Rev2 -> Pkg1Rev4 Server B looks at its own revisions since 10am and sees Pkg1 now has two heads. It calculates diff of Pkg1Rev2 -> Pkg1Rev5. Server B takes Pkg1Rev2 and applies the two diffs in the order of priority, logging any conflicts, calling the result Pkg1Rev6. == Tickets == * Sync set-up stored in config file (server URI). Last sync status stored in local db. * Repository method 'all_revs_since'. It returns all revisions since a time/revision (or since the beginning). * Object method 'diff'. It returns a Diff object which is the diff of two !ObjectResources. Already exists for Package, but need for !PackageTag, !PackageExtra * Revision method 'serialize'. * Diff method 'serialize'. * API access to revisions: /api/search/revision?since=ab49f348-fd23-ae3c * API access to diffs: /api/diff/revision?diff=8f77c992-5eec-4909&oldid=ab49f348-fd23-ae3c * API access to unrevisioned objects? * API access to version of CKAN / vdm. * Repository method 'import_revisions'. It takes serialised revisions and diffs and creates revision objects exactly matching spec. * Object method 'merge_diffs'. It takes an original object and two diffs that apply to it and applies them both in a new revision.