简体   繁体   English

“ XML往返”对Word 2003文档有什么影响?

[英]What are the effects of an “XML Roundtrip” on Word 2003 documents?

Saving a Word 2003 document to XML and then back results in a reduced file size, and probably more that I don't know about. 将Word 2003文档保存为XML,然后再保存为XML,可以减小文件大小,并且可能还有更多我不知道的文件。 A diff on the WordML of the new document against the old shows differences only in the revision save ID's . 新文档的WordML与旧文档的差异仅显示修订版保存ID的差异。 So, what is getting lost in the roundtrip ? 那么, 往返过程中丢失了什么?

If nothing is actually getting lost, then how would one explain the few thousand bytes off the size of the file? 如果实际上什么都没有丢失,那么如何解释文件大小的数千字节呢?

The following is just a guess. 以下只是一个猜测。

.doc file is actually OLE structured storage compound file . .doc文件实际上是OLE结构化的存储 复合文件 The latter is a way to pack multiple streams in a single document in a well-defined way, and the structure is actually pretty close to a filesystem-in-a-file - for example, it has "sectors", and sector allocation table. 后者是一种以明确定义的方式将多个流打包到单个文档中的方法,并且该结构实际​​上非常接近文件中的文件系统-例如,它具有“扇区”和扇区分配表。 Such an approach makes it possible to edit document file in-place without rewriting it completely. 这种方法可以在不完全重写的情况下就地编辑文档文件。

However, this storage approach results in some redundancy, such as unused sectors. 但是,这种存储方法会导致一些冗余,例如未使用的扇区。 When you roundtrip the file, you effectively recreate it from scratch, and thus any such redundant storage artefacts are eliminated. 往返文件时,可以有效地从头开始重新创建它,因此可以消除任何此类冗余存储伪像。

As far as I know Word stores some information in addition to text and formatting in the DOC files, for example user information, some stuff on the document history, etc. This information accumulates when using "File > Save". 据我所知,Word除了在DOC文件中存储文本和格式外,还存储一些信息,例如用户信息,文档历史记录中的某些内容等。使用“文件>保存”时,这些信息会累积。 I suppose that saving as XML and re-saving as DOC strips that information. 我想将其另存为XML并将其另存为DOC会删除该信息。

If I recall correctly, as simple "Save As" reduces file size already and I think there used to be some menu item that allowed you to save a version of the DOC file that was significantly smaller in size than the "File > Save" version. 如果我没记错的话,简单的“另存为”已经减小了文件大小,并且我认为曾经有一些菜单项允许您保存比“文件>保存”版本小得多的DOC文件版本。 。

If you look at a word document (.doc) in a hex editor, you will see that there are many, many blocks of redundant zeroes. 如果在十六进制编辑器中查看word文档(.doc),您会发现有很多冗余的零块。 Great format, doc! 很棒的格式,文档!

Anyway, saving to XML and then back to doc might get rid of some of those thousands of zeroes bytes. 无论如何,先保存为XML,然后再保存为doc,可能会摆脱掉数千个零字节。

If you're really curious just open both files in a hex editor and run a difference algorithm, you can try Hex Workshop and Hex Editor Neo. 如果您真的很好奇,只需在十六进制编辑器中打开两个文件并运行差异算法,可以尝试使用Hex Workshop和Hex Editor Neo。

My experiments with a few large Word 2003 documents shows that saving as XML, then saving that as .doc, indeed results in a slightly, though not significantly, smaller file. 我对一些较大的Word 2003文档进行的实验表明,将其另存为XML,然后另存为.doc,确实会导致文件稍小(但不是很大)。 As you point out, the rsidR attributes are different, but that does not account for the reduction in size since the new rsidRs are typically the same size. 如您所指出的那样,rsidR属性是不同的,但这并不能说明大小的减小,因为新的rsidR通常是相同的大小。

As Danra points out, .doc files have runs of identical bytes. 正如Danra所指出的,.doc文件的运行字节相同。 But the smaller file saved as .doc also has such runs, so I believe this is an artifact of the .doc binary format and not information-carrying data. 但是保存为.doc的较小文件也具有这种运行方式,因此我认为这是.doc二进制格式的产物,而不是携带信息的数据。 I eyeballed a few of the round-tripped .doc files and could see no difference in appearance at all, supporting the idea that the differences are not information-carrying. 我盯着几个往返的.doc文件,根本看不出外观上的差异,支持这种差异不承载信息的想法。

Examining the XML files created after round-tripping shows the main difference is several rPr (run properties) with no content are removed after converting to XML. 检查往返后创建的XML文件,显示的主要区别是转换为XML后没有内容的几个rPr(运行属性)被删除。 It seems saving as XML removes unused character styles and properties. 似乎可以节省下来,因为XML会删除未使用的字符样式和属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM