~kevang/mnemosyne-proj/grade-shortcuts-improvements : contents of mnemosyne/openSM2sync/README at revision 1855

~kevang/mnemosyne-proj/grade-shortcuts-improvements : (revision 1855)
Introduction:

The aim of this document is to provide a high level overview of the syncing protocol which is used to synchronise different instances of Mnemosyne (e.g. the desktop version, versions on mobile phones, webserver). However, the protocol is sufficiently general so that other SRS programs based on the SM2 algorithm can implement it, leading to increased interoperabibilty. 

Also, some of the design choices are explained here. For the nitty-gritty and the details needed to actually implement the protocol, see the documentation in the reference implementation of the client and the server.


Features:

-full bidirectional sync, i.e. after your last sync you can add new cards on your desktop and review existing cards on your phone. During the next sync, these changes will be merged seamlessly.
-sufficiently general, so that many different SM2-based programs can interoperate.
-support for multiple partners, i.e. you can sync your phone both with your desktop and with a webserver. (The only limitation is that there can be no cycles in the sync graph -see later).
-can be implemented efficiently and without using much memory on a mobile device.


Some use cases and definitions:

The client is the one that initiates the sync, and the server is the other party. E.g. we could have a Windows Mobile client syncing to an application running on the desktop, a laptop client syncing to that same desktop application. At the same time, that desktop application could also act as client when it syncs to a publically accesible website. In order to make all of these scenarios possible, we need to have the concept of a partnership: a link between two different instances (where instance is defined as being either client or server). Each instance can have multiple partnerships, each with a different other instance. 

A server can be either single user or multiuser. Applications running on desktops or mobile devices are typically single user, whereas a public webserver is multiuser.



The history log as main driver for the sync process:

In order to avoid the calculation of many checksums at sync time, each instance should keep track of a history of all the events that happened since the last sync. This seems like a heavy burden, but most SRS systems already store the repetition history for statistical reasons. The only extra information to be stored is when e.g. cards were added, edited, deleted, ...

As already mentioned, the history log is the main driver of the syncing process, i.e. we send across each log event successively, if necessary supplemented by extra information. E.g., the 'new card' event would need to be supplemented by sending the actual card data across. The entire process can also easily be implemented in a streaming manner both on the client and the server side, such that it does not require a lot of memory.

Storing all of this in the history also makes it very efficient to handle multiple partnerships: for each partnership, we only need to store the index in the history corresponding to the last sync.  This can be used to easily create the changeset for that particular partnership. (Other options, like e.g. setting a 'needs_sync' flag on datastructures would be less efficient as that would need to be done once per partnership, leading to much redundancy in storage.)



Interoperability with other SRS systems:

Some SRS systems are based on two sided cards only, others have a fact/card model for N-sided cards, yet others support card types which can incorporate different behaviour still. Also, the set of attributes describing a card can vary between different SRS systems. It would be possible to design a protocal in such a way that taking any path between any number of SRS systems would result in no information loss, but this would place a heavy burden on each SRS system, as it would require each SRS system to hold on to the information from the other SRS systems. Therefore, we went instead with a protocol with the following properties:

-moving data from SRS system A to SRS system B and back will never result in information loss.
-Say you have three systems. A and C support a feature that B does not. If you move data from A to B and then onto C, you will have information loss (e.g. an N sided card being flatted out to N 2-sided cards, even though C has support for N-sided cards). However, this is not a big limitation from a practical point of view: the user just has to move its data from B back to A before syncing with C, as opposed to directly syncing B with C.

If the destination does not know about a card type that the source does know about, then the source sends along the question and the answer explicitly. The destination is then required to treat this card as review-only, i.e. no editing is allowed. When the data flows back to the original source, no information is lost in this way.

If both source and destination know about the card type, there is no need to send across question and answer, but both instances need to know the card's fact id and factview id Together with the fact data and their joint knowledge of the card type, this will allow both parties to generate the question and answer from the fact data. (Which capabilities both SRS systems have is determined during the login phase of the protocol.)

If one SRS system requires extra variables for its functioning that the other one does not know about, there is no need to send those across during sync or for these extra variables to be stored by the other party. The other party will not modify those anyway, and the sender still has access to these variables. If the sync happens between two instances of the same SRS system, and the SRS system requires extra variables, then of course that extra data needs to be sent across.



Identifying people and machines:

Be sure to distinguish the following concepts:

-the user chosen id and password
-an anonymous but unique machine generated id identifying the user, used to upload revision history anonymously to the science servers
-an anonymous but unique machine generated id identifying each machine



Media files:

Stuff like pictures, sound files, etc... is stored in a special media directory, which obviously can be in an entirely different location in both the client and the server. During the sync process, only pathnames relative to that mediadir are sent across. Subdirs in the mediadir are designated by the unix path convention, i.e. / instead of \.

Since media can be modified outside of the program, it makes no sense to use the history to track media. Rather, each instance should keep track of which media files are associated with a deck, and what the last know modification date or checksum is was in order to determine which media files need syncing. This timestamp or checksum is considered internal to the program, and not sent across during sync, to allow for the fact that different programs can use different methods to track file updates. E.g., checksums could be more expensive that timestapms on a mobile device.

If the client identifies itself as a card-based client, then the server will automatically convert latex tags to images, without the client having to code anything.



Cycles in the syncing graph:

Cycles in the syncing graph are not allowed. E.g., A can sync with B, and B can sync with C, but then A and C are no longer allowed to sync directly, without going through C.

To detect this, each sync partner not only keeps a list a partnerships resulting from a direct sync, but also from an indirect sync. I.e., if A sync with B and then B with C, afterwards A should figure in the partnerships of C (but with e.g. a dummy index). 

Therefore, upon login, the list of all the partnerships is sent to the sync partner. It is then the server that will check for the existence of cycles.

Getting rid of this limitation would be possible, but costly in terms of extra storage and processing power. Consider e.g. the following scenario:

-A and B sync, and afterwards both make separate local changes
-B and C sync, and afterwards C makes local changes
-A and B sync, and afterwards A makes local changes

Upon the following sync between A and C, the changes of A to be sent across to C are no longer continuous in A's log. To be able to correctly determine with log entries should be synced, each log entry should have a UUID as opposed to local id, but this will cost much more storage and processing time. Another option might be to introduce sync events in the log, recording which partner synced to which partner and at what point, but this would be rather tricky to implement.


Conflicts:

Upon detecting conflicts, the full database from either the client or the server is copied to the other side, the direction depending on the user's indication. After doing so, the partnership table of the database that was copied over is adapted such that client and server id's are reversed. There are routines to either send across the database as xml (for non-Mnemosyne clients which have not implemented a binary extension to the server) or as binary. 

Note however that there is no way to undo logs that have been prepared to send on the science server. This is somewhat unfortunate, but the sheer size of science database should mitigate this effect.

Another side effect is this approach is that it is only possible to keep the local version after conflict, if the client keeps all the same information as the server. So e.g. card-only based clients will only be able to keep the remote server version after conflict. The same is true for libmnemosyne based clients which have chosen not to keep the entire revision history in order to save resources.

The initial plan was to do something faster by doing incremental changes to the database: erasing the parts of the logs that need to be discarded, and constructing a stream of log entries to undo the changes based on information from the other side (e.g. a DELETED_TAG entry would be converted to an ADDED_TAG entry). However, this is not full proof in the case of multiple partners (for similar reasons as the cycle problem). Consider e.g. the following scenario:

-A, B, C are in sync
-A and B make conflicting changes
-B and C sync
-C makes non conflicting changes
-A and B sync, conficts are detected and A wins

When now B and C sync, the part of C's log that needs to be undone is no longer last in its log, so there is no easy way for B to tell C which parts need to be discarded. Also here, a way around would be UUIDs for each log entry or logging sync events, but that would either be very costly or complicated.



Reference implementation:

For the actual details of the syncing process itself, the code of the reference implementation and the comments in that code are the places to look:

-ui.py: interface class abstracting away how the client and the server interact with the user
-database.py: the interface of the database class. Also contains an example SQL schema for a minimal client which only supports cards, not facts.
-log_entry.py: describes all the types of events stored in the history log, together with the information that needs to be associated with it
-partner.py: code shared between client and server, mainly dealing with how to bundle log entries and media files
-text_format.py: interface for a class describing how the info should be sent across in a textual way. Only implemented so for is an xml protocal, described in text_formats/xml_format.py
-binary_format.py: in some cases, it's much faster to be able to send across a binary copy of the full database, e.g. on the initial sync. This of courses requires that the server is able to generate a binary database in the format that the client can read. Implement so far is binary_formats/mnemosyne_format.py, for clients based on libmnemosyne.
-client.py: the syncing process as seen from the client side
-server.py: the sycing process as seen from the server side

It's also interesting to have a look at the testsuite in tests/test_sync.py, for inspriration on what corner cases to be prepared for :-)



Syncing speed:

After robustness, speed is the most important property of any implementation of the sync protocol. An efficient sync will contribute greatly to the comfort of the user and the percieved quality of the client.

The Python reference client and server implementation have already been optimised to this effect. For a typical daily scenario of doing 200 repetitions of your phone and adding 20 cards on your desktop, a sync takes about 4 seconds on a home WLAN with a rather old (2007) Windows Mobile phone. Memory requirement is about 4 Mb. This is without any GUI running, and does not take the time into account to backup the database before sync (which in this case turns out to be the dominant factor).

When implementing the client in a different language, here are some tips to help improve performance:

-profile your client to detect where the bottlenecks are
-look for the most optimal XML parsing library, preferably a compiled one as opposed to an interpreted one
-buffer *all* the http connections, i.e. read them block by block (e.g. 8kB), as opposed to e.g. log entry by log entry
-to save memory, do all processing in a streaming manner, i.e. don't read in everything at once, but process block by block by block and throw away the block as soon as it's processed
-make sure to support all the time saving feature the server supports, like binary download of the very first sync
-at the beginning of 'client.py', there are a few network tweaks which are useful in the context of the Python network stack, but which could also be worth investigating
-reuse the same http connection for the different stages of the sync
-if you don't care about historical statistics, don't download the entire history on the first sync ('interested_in_old_reps' setting)
-run the sync operation in a separate thread to keep the UI responsive
-progress bars are nice, but don't update them after every log entry, as then you would spend more time updating the GUI than doing the actual processing