简体   繁体   English

如果简单地将操作转换视为纯文本,那么操作转换是否适用于HTML等结构化文档

[英]Does operational transformation work on structured documents such as HTML if simply treated as plain text?

The FAQ of Google Wave Protocol says that [HTML] "does not have desirable properties" and that "HTML makes OT (Operational Transforms) difficult if not impossible" [1]. 谷歌波浪协议的常见问题解答说[HTML]“没有理想的属性”,并且“HTML使OT(操作变换)变得困难,如果不是不可能的”[1]。 Why is this so? 为什么会这样? What problems arise if HTML is treated simply as plain text and then OT applied? 如果将HTML简单地视为纯文本然后应用OT,会出现什么问题?

  1. http://www.waveprotocol.org/faq#TOC-What-s-the-XML-schema-for-waves-Why http://www.waveprotocol.org/faq#TOC-What-s-the-XML-schema-for-waves-Why

I'm assuming here you understand the basics of OT. 我假设你在这里了解OT的基础知识。 The principal problem with doing OT on HTML as plain text is that of merging html tags. 将HTML作为纯文本进行OT的主要问题是合并html标记。 As a simple example, say we had a document as follows: 举个简单的例子,假设我们有一份文件如下:

Hello world

Alice then decides that world should be in bold: 爱丽丝然后决定世界应该是粗体:

Hello <b>world</b>

This can be represented with a double insert operation in OT, schematically: 这可以用OT中的双插入操作来表示,示意性地:

Edit A: Keep 6 : Insert "<b>" : Keep 5 : Insert "</b>"

If Bob decided that 'world' should be italic before he saw Alice's edit, he would add the operation 如果Bob在看到Alice的编辑之前决定“世界”应该是斜体,他会添加操作

Edit B: Keep 6 : Insert "<i>" : Keep 5 : Insert "</i>"

If the server received Bob's edit after Alice's, it would need to transform B against A to become B'. 如果服务器在Alice之后接收Bob的编辑,则需要将B转换为A以变为B'。

The Keep statements are unchanged through transformation, but Insert "" transformed over Insert "" can become either Keep 3 : Insert "" or Insert "" : Keep 3. Usually the server will be configured to place the later edit after the first edit. Keep语句通过转换保持不变,但Insert“”转换为Insert“”可以变为Keep 3:Insert“”或Insert“”:Keep 3.通常,服务器将配置为在第一次编辑后放置后面的编辑。

Edit B': Keep 6 : Keep 3 : Insert "<i>" : Keep 5 : Keep 3 : Insert "</i>"

Here the problem becomes obvious. 这里的问题变得很明显。 Applying A then B' to the original string gives the invalid html: 将A然后B'应用于原始字符串会产生无效的html:

Hello <b><i>world</b></i>

Theoretically this could be solved by varying pre and post inserts, but this would get hairy for more complicated examples, potentially involving a full document scan for every transformation. 从理论上讲,这可以通过改变插入前后插入来解决,但是对于更复杂的示例来说这会很麻烦,可能涉及每次转换的完整文档扫描。

As the other answer noted, this mess can be avoided using out-of band annotations + plain text. 正如另一个答案所指出的那样,使用带外注释+纯文本可以避免这种混乱。 Another approach I've only seen so far in academic papers is to treat the XML structure as a tree with OT operations for node addition, deletion, eg: 到目前为止我在学术论文中只看到的另一种方法是将XML结构视为具有OT操作的树,用于节点添加,删除,例如:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.74 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.74

I don't have a complete answer but I'm interested in seeing more work done on making existing open source operational transformation libraries work with rich text, so I'll contribute what I know. 我没有完整的答案,但我有兴趣看到更多的工作,使现有的开源操作转换库与富文本一起工作,所以我将贡献我所知道的。

The important difference between HTML and the Wave schema seems to be the way text formatting is marked up: a heirarchy of nested tags for HTML vs. out of band annotations (in the footer of the document) with ranges for Wave XML. HTML和Wave模式之间的重要区别似乎是标记文本格式的方式:HTML的嵌套标签和带外注释(在文档的页脚中)的层次结构,包含Wave XML的范围。 Out of band annotations are probably a more natural way to mark up text formatting since they allow overlapping (non-nested) formats. 带外注释可能是标记文本格式的更自然的方式,因为它们允许重叠(非嵌套)格式。 It allows something like this (in pseudo-markup), which would not be valid XML using the nested representation: 它允许这样的东西(在伪标记中),它不是使用嵌套表示的有效XML:

(b) This is bold (i) while this range is both bold and italic (/b) and this last bit is just italic (/i)

Related, here is the relevant issue in the ShareJS project. 相关,这是ShareJS项目中的相关问题 Perhaps they can implement rich text support by adopting part of the Wave XML schema. 也许他们可以通过采用Wave XML模式的一部分来实现富文本支持。

There are approaches in OT that support SGML (superset of XML), but there are no implementations. OT中有一些方法支持SGML(XML的超集),但没有实现。 Therefore, it is not impossible! 因此,这不是不可能的! Though, I agree, OT is not the best approach to enable XML. 虽然,我同意,OT不是启用XML的最佳方法。 This is because OT was designed for linear data structures. 这是因为OT是为线性数据结构而设计的。 But HTML/XML is much more complex: it has attributes, and it is built like a tree. 但是HTML / XML要复杂得多:它具有属性,并且像树一样构建。 The fact that it is a tree is solvable, but the attributes - which is realized as an ordered associative array - are not supported by OT. 事实上它是一棵树是可以解决的,但OT不支持属性 - 它被实现为有序的关联数组。 Simply because associative arrays are not supported by OT (at the moment). 仅仅因为OT不支持关联数组(目前)。 The approach above actually recommends to treat the attributes as a string: Eg "id='myid' value='mystuff'" But you can easily break the whole syntax of your 'attributes-string', when one user deletes all attributes, and another one inserts a " character directly after "mystuff". This could resolve in some div tag that looks like this <div "> , which is not valid syntax. 上面的方法实际上建议将属性视为字符串:例如“id ='myid'value ='mystuff'”但是,当一个用户删除所有属性时,您可以轻松地破坏'attributes-string'的整个语法,并且另一个插入“mystuff”之后的“字符”。这可以解决一些看起来像这个<div "> div标签,这是无效的语法。

Maybe this interests you: 也许这让你感兴趣:

CEFX is a project that aimed to support XML - it's dead to my knowledge. CEFX是一个旨在支持XML的项目 - 据我所知,它已经死了。 But it uses an OT approach. 但它使用OT方法。 For some reason it is not possible to edit string - only xml elements. 由于某种原因,无法编辑仅字符串的xml元素。

Google's Drive SDK supports graph-like data structures. Google的Drive SDK支持类似图形的数据结构。 It is, however, proprietary and nobody knows how it works. 然而,它是专有的,没有人知道它是如何工作的。

I am developing a framework that supports arbitrary data structures. 我正在开发一个支持任意数据结构的框架。 Currently, Text, Json, XML, and HTML are supported. 目前,支持Text,Json,XML和HTML。 It has a different approach: check it out: Yatta! 它有一个不同的方法:检查出来: Yatta!

BTW: What the Wave protocol, and Eric Drechsel described is known as Annotations in OT. BTW:Wave协议和Eric Drechsel描述的内容在OT中被称为Annotations。 It is commonly leveraged to support rich text. 它通常用于支持富文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM