For the new version of our document management module, we wanted to provide online-editing, the ability to create and edit documents via a web-interface. Google has used it for one of its main products: Google Documents.
Advantages of creating a document on the web are:
- You do not need to think about creating backups;
- Members of a team can work together on a document at the same time;
- Usually, the online-editing tool comes with revision control, so you can check who changed what and restore a previous text.
There are also some disadvantages to take into consideration:
- The are currently no online word-processors that are as good as Microsoft Word, or Open Office Writer. An HTML editor is used to create the "document". That means no table of contents, no auto-numbered chapter titels, and no footnotes.
- Online-editing only works if you are online most of your working time.
But for a small team this tool can be very handy to keep track of progress and to take notes for others to work out. A kind of wiki, where a complete wiki would be overkill.
Given that our backup-facilities are well in place, and we are using a powerful HTML editor, Tiny MCE, the real challenge was to allow multiple simultaneous edits. One of the great things about online editing is that it uses a simple basic text format for the source text. From that, PDF or Word files can be created. The basic text format is HTML and since this is such an easy format, it allows you to perform text-manipulation functions, like diff and merge that would not be possible, had the source format been Word, or even ODF. Given these functions, this opens the door for multiple persons working on the same document at the same time.
Diff and merge
To integrate changes that were made by two users simulteneously, you first need to determine the changes that were made, and then to merge them into a new version.
To "diff" two texts, means, to determine the difference between two texts. Take for example the original text:
Bilbo Baggins, the titular protagonist, a respectable, conservative hobbit.
add an extra adjective and remove one too:
Bilbo Baggins, the protagonist, a respectable, conservative, noisy hobbit.
you can easily see what has changed. "titular" was removed, ", noisy" was added. A "diff" is nothing more than a list of additions and deletions that make up the difference between the old text and the new one. There are respectable algorithms that allow the computer to see these differences as well, and they are well documented.
The diff itself can be useful to show the user what has changed between two versions, but what happens if two or more persons work on the same document simultaneously? An other person might edit our example sentence in a different way,
Bilbo Baggins, the titular protagonist, a respectable, conservative, adventurous hobbit.
How can you create a new version of the text that respects both changes? To explain this you need to be familiar with the concept of a "common source". The common source here is the original text, before it was changed by the two persons. If a person A starts to edit a text, the current version number of the text is stored somewhere. The person edits the text. This can take half an hour. In this time someone else (person B) may edit this same text too and save it. This is a simple edit that does not require a merge. This save has created a new version of the text, which at this point is the "latest version". When person A is done and saves the text, the program looks up the stored version number and detects that the version that was edited is not the latest version of the text. It then starts the merge:
- It determines the diff between the "latest version" of the text (submitted by person B) and the common source: -(4)"titular", +(8)", noisy";
- It determines the diff between the "new version" of the text (submitted by person A) and the common source: +(8)", adventurous";
- It combines these differences (and resolves any conflicts this may entail): -(4)"titular", +(8)", noisy", +(8)", adventurous";
- It applies these combined differences to the common source and this forms a new version of the text.
This algorithm is called a three-way merge. You can see that the algorithm still works if some other person has edited the text after person B and before the commit of person A. And it also works if three or more people edit the document.
Make or buy?
The algorithm is easy to understand and has been implemented many times. Why not just pick the cheapest one and use it? Because there are many things that still need to be considered. For starters, the common ones:
- Is it available for my platform (Windows, Linux, OS X);
- The type of license that is used;
- Is it bugfree or are bugs quickly resolved by the development team / person;
- Is it open source / can I make changes, does the license allow this and allow this to be used commercially?
- Is it fast enough for my application?
But in this case there are also some specific questions to be asked
- Does it only work on a per-line basis, or also per-word?
- Does it have any special handling of HTML texts (where "tags" should be treated as words, and that makes sure the result of the merge is still (X)HTML compliant);
- How strong are its automatic conflict-resolution heuristics?
These risks need to be balanced against the question: how much time does it take to make it ourselves? And the follow up: can we make it fast enough, just using PHP?
Another thing, not mentioned so far, is that we wanted the tool to be so good that it completely dispensed with the need to resolve textual conflicts manually. The reason was that the documents should also be editable via a WebDAV interface and thus via any off-line desktop application. It this environment the user would not have the ability to resolve conflicts (at least not in a user-friendly way). By creating the tool in-house we would always have the possibility to improve the heuristics needed.
Performance. By using PHP as implementation language we risked that the merge would become too slow. Therefore we used all the optimizations we could find. The algorithm creates, and iterates over, an array the size of n2, where n is the number of lines, or the number of words in a single line. Therefore the main thing is to keep the number of tokens low. We used "identity integers" to represent lines of text, and only if two lines conflicted, were the conflicting lines split into words and turned into identity integers. It is also a good idea to determine the first and last lines of the text where the changes are made and use only the lines in between. This easily halves the search space.
Treat HTML different from plain text. If an HTML string was tokenized the same as plain text, a simple tag like "<img src='https://www.image.com/html/galleries/shipboard38.html'>" would consist of at least 20 tokens. By treating tags as words, this tag compares to a single token. This is important for the performance of the Longest Common Subsequence function.
Tidy up. After the merge was complete, it would sometimes be unavoidable to create XHTML. This was then cleaned up using Tidy. For example:
- Source: An interesting site.
- Change 1: <b>An interesting site.</b>
- Change 2: <a href=''>An interesting site.</a>
Result: <b><a href=''>An interesting site.</b></a>
Cleaned up: <b><a href=''>An interesting site.</a></b>
Whitespace tokens. Determine if you want to regard whitespace characters as first class tokens. Suppose someone adds an extra tab or space character. Do you want to save this change? If so, whitespace is not just a separation character, it is a token by itself.
How do you determine if a line has "changed"? The Longest Common Subsequence algorithm only determines additions and removals. However, you also need to know if a line changed, so you can merge the changes. You cannot simply say that a line has changed if it is different than before. The line may have been removed and another line been moved to its location. As a heuristic we took this fact and concluded that if a line was different and the new line was not present in the source version, the line must have changed.
Conflict resolution. When the lists of differences are combined, you can get different types of conflicts, that need to be resolved in different ways:
- Line L is added by user A. Line L is added by user B. The same line is added by two persons at the same time. In the resulting version the line should be added only once, of course.
- Line L is removed. Line L is removed. This is not really a problem, since the line can only be removed once.
- Line L is removed. Line L is changed. Since a "change" consists of a removal and an addition, the two versions have a removal in common. Therefor the line is changed and not removed.
- Line L is changed. Line L is changed. The two new versions need to be merged. The line is split into words and the same merging process for the lines is repeated for words.
Recursion limits. The wiki-page algorithm used recursion to read out the differences. At the time we ran the integration tests we found out that our integration tool complained that the recursion level exceeded 100. At that point we discovered that it is a real PHP limit and had to create a custom "call-stack" for this function. PHP can be funny that way.
When the merger was ready it needed to be put to the test. An edge case was created in which a typical wiki page was edited by two concurrent users both at the beginning and at the end (as a unit test). This case would be very rare in reality, but if it occurred, it should not take up all system resources or take a minute to complete. It was a great relief to find that this merge only took three seconds, which was acceptable for such a rare occasion.
It is worth pointing out that Wikipedia was our only source of information. It is quite normal these days to start researching a subject by looking up a Wikipedia page, but this time the information available there was so complete and usable that there really was no need to look further and it also gave us great confidence to embark on this adventure. This is really impressive.