wiki:SyncingInstances
Last modified 4 years ago Last modified on 02/18/10 18:16:16

Distributed Data and Syncing Between CKAN Instances

Aim: to support pulling and pushing metadata between different CKAN registries while preserving history.

This problem has strong similarities to distributed version control and distributed databases.

Research

  • Distributed version control systems:
  • Distributed databases - NB: we are looking to preserve history (not just allow multiple distributed writes)

Use cases

1. data.gov.uk and ckan.net

Want to make data on data.gov.uk (hmg.ckan.net) available in a public CKAN instance. We will therefore end up with:

  1. Package on data.gov.uk
  2. Package on ckan.net

Need to keep these two representations of the package in "sync".

Remarks: This is easy if we only edit in one place. But what if we want to let community edit on ckan.net? Two options:

  1. have 2 copies on ckan.net one community owned and one locked down
    • Pro: easy to keep stuff separate
    • Con: terrible user experience and still have issue that two items can diverge
  2. Have one copy that is world editable into which gets *merged back* into the official data every so often

Detailed Problem Description

Let us consider different CKAN registries: A,B,C

Consider the following way we may want information to travel::

  A -------
           \
            C
  A --> B--/

In words:

  1. Changes go directly from A to C
  2. Changes go directly from A to B. Then changes from B are pulled to C
 ra1    ra2
  A -----------> A'
    \             \
     \             \
      B---> B' ---> B'' (need to merge r2 and r3)
      ra1   rb1       

Specification v1

Scenarios

  • 1-way: On setup, Server A's packages are copied to Server B. On sync, changes to packages on Server A are transferred to Server B.
  • 2-way: On setup, packages from each server is copied to the other. On sync, changes on each are transferred to the other.

We will focus now on 1-way, leaving 2-way for the future.

Requirements

  • Merging of changes from both machines. If there is a conflict then it is logged and a result is chosen.
  • Use of Server A and Server B continues undisturbed during sync.

Issues

  • Clashes of package/tag/group names.
  • Sync between CKAN instances of slightly different versions of ckan & vdm.
  • Unversioned objects - make versioned? User, Group, Authz, Rating.
  • How to test system.
  • Copy authorization tables? Allow API access to objects not-authorized for reading by visitor?

Use cases

  • First sync - all packages and revisions are copied from Server A to Server B.
  • Subsequent sync after package changes on A and/or B.
  • Sync after package purged on A. (Package also purged on B.)
  • Sync after package purged on B. (Package not recreated on B.)
  • Server B syncs at different times from a third server.
  • Package/Tag/Group? name on Server A clashes with an existing one on Server B. Log all of them. Merge tag and group. Not sure about package.
  • Objects on Server A with restricted authz are by default editable on Server B.

Operation

First sync - 10am

Server B asks "Give me all your revisions and unrevisioned objects."

Server A replies "Rev1 and associated revisions Pkg1Rev1, Pkg2Rev1, !PkgTagRev1, !PkgResource1; Rev2 with Pkg1Rev2; Tag1; User1; !PackageGroup1; Group1; ratings"

Server B creates Rev1, Pkg1Rev1, Pkg2Rev1, Pkg1Rev2, Pkg1, Pkg2, !PkgTagRev1, !PkgResource1, Tag1, User1, Auth, !PackageGroup1, Group1. UUIDs are the same as Server A.

Server B updates search vectors for Pkg1 and Pkg2.

Server records the time of the sync - 10am.

Meanwhile - 10.20am

On Server A, user updates Pkg1 twice, creating Rev2/Pkg1Rev3 and Rev3/Pkg1Rev4PkgTagRev2. User1 updates his name.

Meanwhile - 10.40am

On Server B, user updates Pkg1 once, creating Pkg1Rev5.

Sesequent sync

Server B asks "Give me revisions and diffs since 10am."

Server A replies Rev2/Pkg1Rev3, Rev3/Pkg1Rev4 and gives diff of Pkg1Rev2 -> Pkg1Rev4

Server B looks at its own revisions since 10am and sees Pkg1 now has two heads. It calculates diff of Pkg1Rev2 -> Pkg1Rev5.

Server B takes Pkg1Rev2 and applies the two diffs in the order of priority, logging any conflicts, calling the result Pkg1Rev6.

Tickets

  • Sync set-up stored in config file (server URI). Last sync status stored in local db.
  • Repository method 'all_revs_since'. It returns all revisions since a time/revision (or since the beginning).
  • Object method 'diff'. It returns a Diff object which is the diff of two ObjectResources. Already exists for Package, but need for PackageTag, PackageExtra
  • Revision method 'serialize'.
  • Diff method 'serialize'.
  • API access to revisions: /api/search/revision?since=ab49f348-fd23-ae3c
  • API access to diffs: /api/diff/revision?diff=8f77c992-5eec-4909&oldid=ab49f348-fd23-ae3c
  • API access to unrevisioned objects?
  • API access to version of CKAN / vdm.
  • Repository method 'import_revisions'. It takes serialised revisions and diffs and creates revision objects exactly matching spec.
  • Object method 'merge_diffs'. It takes an original object and two diffs that apply to it and applies them both in a new revision.