Ticket #1129 (new CREP) — at Version 5

Opened 3 years ago

Last modified 2 years ago

CREP0002: Moderated Edits

Reported by: kindly Owned by:
Priority: major Milestone: ckan-v1.5
Component: ckan Keywords:
Cc: rufus.pollock@… Repository: ckan
Theme: none

Description (last modified by kindly) (diff)

Proposer: David Raznick

Abstract.

We are trying to achieve these goals.

  • To get people involved with making edits to CKAN metadata.
  • To have an ownership model as to who can moderate and validate these changes
  • To not put too huge a burden on these owners.

In order to achieve this, a feature which lets anyone edit a package but only let the moderator/owner accept it. The moderator should be able to look at a list of changes and accept the ones that

This cep is not about 'if' we need such a feature, it is about 'how' we go about implementing it. Another cep may needed for the 'if' case.

The Problem

We need the following to be possible.

  • Storing revision of objects that are not the current active one.
  • A way of the user viewing past revisions.
  • Accessing not only the history of a particular object but also of related objects at that time. i.e If a resource related to a package changes we need a way to see this when looking at the package.
  • A robust way of doing this in the face of database schema changes.
  • Make sure database queries are quick.

Solutions.

  1. Store the whole dictization of the package and all its related objects every time you change anything in its dictized representation and only save to the database proper if accepted.

Pros

  • Easy to implement, we already have a preview which makes the dictized form of a package without actually saving it. This will just need to be persisted in some way.
  • Fast retrieval.
  • Potential to store a branching revision tree of changes.

Cons

  • No easy way to remake the dictized packages historically or if there is an there a change in the way we represent packages, i.e schema changes.
  • Will only work for the particular objects we decide to store these changes for.
  • Stores a lot of repeated information
  1. Write specialized queries for every read of the database looking only at the revision tables.

This method requires there to be a change in the way we use VDM, so that we manage statefulness ourselves. We will need to add other states such as 'waiting for approval'.

Pros

  • No specialized storage required
  • Only need to change queries when schema changes
  • Can be made to work easily for other objects

Cons

  • Slower query time on read, as even looking at the last active package will need to do a fairly complicated query.

Implementation details.

1.

A new table with columns id, user, package_id, timestamp, revision_id, parent_id, dictized_package. revision_id should be null unless it is actually persisted to the database. parent_id is the id that this package_dict was changed from.

We could store only the diffs of the dictized_package as long as we assure that everything inside the json is stably sorted, this will make getting the historical data out slower.

Getting out the history of the dictized packages is an intensive task, as it will require replaying the whole history of all the changes and creating the dict for each change. This re-caching will need to be redone for every change we make to dictized representation of a package.

2.

Every normal packages read needs to look at the revision table to see the last accepted change in the dictized representation of the package. We also need to way to get what the dictized representation of the package was like at any point of its revision history. This querying is non-trivial in sql.

Participants

David Raznick to do it.

Progress.

Decided to go with option 2. However we will change the revisioning system to be like the schema attached. This gets rid of difficult querying problems caused by querying the revision tables by adding an end date, meaning you can do range queries.

The better and more normalized version of a revisioning system is outlined https://docs.google.com/drawings/d/1Y7nMgVsrs081Pame2RdbZHlCAlV33ddTZ8VAsab1j-0/edit?hl=en_GB&authkey=CJfd8vsB. We will be a step closer to that, with this change, but we will keep the current vdm more or less, intact.

Change History

Changed 3 years ago by kindly

script run do test speed of various ways of doing a "last before this date" query

comment:1 Changed 3 years ago by thejimmyg

  • Cc rufus.pollock@… added
  • Priority changed from awaiting triage to major
  • state changed from draft to accepted
  • Milestone set to ckan-v1.5

Great proposal, here's my thinking:

  • The current VDM model (where we have things like a single package table which logs changes to a package_revision table) is very well tested and works well and we should only consider a radical change with good reason
  • Although we are making a move from *model objects* being the central components to the *logic layer functions* being the central components, there is no need yet to have the database structures we store (for revisioning or otherwise) start to look like the dictized representation. The logic layer itself is designed to do that mapping.
  • Since we do have this logic layer though, there is less need for "magic" SQLAlchemy objects to support dicts and lists (eg the stateful objects used to make a list of tags on a package behave like a list), a few simple (and easily debuggable) queries in the logic layer would work better.

In the longer term we may want to look at serialising more dictized-looking objects and that may lend itself to a more dictized changeset model or even a more dictized storage system (eg no-SQL) but for the time being we are not at that point.

I recommend the following:

  • We continue to use the current default branch of VDM (not the changeset branch)
  • We continue to treat the package table (and other non-revision tables) as the most recent revision (even though the *active* revision displayed in CKAN might well be in the revision tables because more recent changes haven't been moderated yet)
  • We slowly stop using stateful lists and dicts in CKAN because we have move control with explicit queries in the logic layer

Sound good?

comment:2 follow-ups: ↓ 3 ↓ 4 Changed 3 years ago by rgrp

It seems like this ticket is getting a bit confused with ticket:1076 (changes re vdm) :-)

Some general conclusions relevant to both:

  • General feeling is not to change to changeset model
    • I'm personally +0 on changeset model
    • I would note main benefit is simplicity and orthogonality of changesets to core system. I also think cost of change is small.
    • That said we can simplify current model as well.
  • There seems some misunderstanding: change to have logic layer has almost nothing to do with being able to remove main stateful stuff in vdm. To be able to remove most of stateful stuff in vdm requires us to make some other changes (re foreign keys from revision objects to continuity)
  • There are other simplifications we should make to vdm before embarking on this (e.g. move to SessionExtension? from MapperExtension?). This is easy as that work has been done in changeset branch and can be backported.

What I don't see in this CEP as yet is any discussion of the complexities around pending stuff (e.g. do we allow multiple pending, do new edits get shown current or latest pending). I'll try and comment more on this but a lot of this was in original vdm discussion last summer.

comment:3 in reply to: ↑ 2 Changed 3 years ago by rgrp

Replying to rgrp:

It seems like this ticket is getting a bit confused with ticket:1076 (changes re vdm) :-)

I of course ;0 meant ticket:1077

comment:4 in reply to: ↑ 2 Changed 3 years ago by kindly

  • There seems some misunderstanding: change to have logic layer has almost nothing to do with being able to remove main stateful stuff in vdm. To be able to remove most of stateful stuff in vdm requires us to make some other changes (re foreign keys from revision objects to continuity)

The logic layer does not automatically help out, however it makes our life easier if we want to handle state ourselves. For example take package tags, if we remove the stateful_m2m properties and just use normal sqlalchemy relations. We will still want statefullness (i.e active, pending, deleted) on the package_tags table. We should update those on the table ourselves in the logic layer.

  • There are other simplifications we should make to vdm before embarking on this (e.g. move to SessionExtension? from MapperExtension?). This is easy as that work has been done in changeset branch and can be backported.

I agree but event thought the MapperExtension? way is not great, it is very well field tested.

comment:5 Changed 3 years ago by kindly

  • Description modified (diff)

Changed 3 years ago by kindly

Changed 3 years ago by kindly

Note: See TracTickets for help on using tickets.