简体   繁体   English

将Word 2007文件转换为XML

[英]Convert the word 2007 file in to XML

I have a word 2007 .docx document. 我有一个Word 2007 .docx文件。 I have created its zip file by adding the extention ".zip" at the end. 我通过在末尾添加扩展名“ .zip”来创建其zip文件。 When I extract the zip file it gave me few folders with xml files in it. 当我提取zip文件时,它给了我几个带有xml文件的文件夹。 I want to get a single xml sheet by combining all the xml files that are there in the folders, in order to write a xsl style sheet for that. 我想通过合并文件夹中存在的所有xml文件来获得单个xml表,以便为此编写xsl样式表。 I do not want to open the ".docx" file and try "save as xml". 我不想打开“ .docx”文件并尝试“另存为xml”。 Is there a way to do that? 有没有办法做到这一点? Or can I atleast have the WordML file of that document? 还是我至少可以拥有该文档的WordML文件? If so how. 如果是这样的话。 Thank you in advance. 先感谢您。

use the tool(set) of your choice that supports unzipping, directory tree walking and line-based text file processing. 使用您选择的工具(集),该工具集支持解压缩,目录树遍历和基于行的文本文件处理。 unzip your word file first, preserving the directory structure of the archive. 首先解压缩您的Word文件,保留档案的目录结构。 next lauch your directory walker on the directory you unpacked into, processing all .xml and .rels files; 接下来,将目录遍历器放到您解压缩到的目录中,处理所有.xml.rels文件; delete the first line (containing the xml declaration, eg <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ) from each of them and concatenate each in turn to your global output xml file, the first line of which should contain an xml declaration itself. 从每行中删除第一行(包含xml声明,例如<?xml version="1.0" encoding="UTF-8" standalone="yes"?> ),并将每行依次连接到全局输出xml文件,第一行应包含xml声明本身。 make sure that your tools respect the charset encoding of the xml files (which should be utf-8). 确保您的工具遵守xml文件的字符集编码(应为utf-8)。

The Perl packages File::Find and Archive::Zip come handy for this task but you can get your job done with standard cli tools (zip/unzip, find, cat, sed). Perl软件包File :: FindArchive :: Zip可以很方便地完成此任务,但是您可以使用标准cli工具(zip / unzip,find,cat,sed)来完成您的工作。

You may have to complement the toplevel Relationships elements of the .rels files with some synthetic distinguishing attribute to avoid id clashes - the applicability of most relationship entries should be unique given the Type attribute but the ms specs appear a bit vague on whether office itself guarantees unique ids over all relationship items of the same kind (or i haven't read the specs thoroughly enough ...). 您可能需要使用一些综合的区分属性来补充.rels文件的顶级Relationships元素,以避免id冲突-在给定Type属性的情况下,大多数关系条目的适用性应该是唯一的,但是ms规范对于Office本身是否保证似乎有些含糊相同类型的所有关系项目上的唯一ID(或者我还没有足够详细地阅读规范...)。 note that names of relevant non-xml files (graphics, vba code) show up in [Content_Types].xml and the relationships file. 请注意,相关的非XML文件(图形,VBA代码)的名称显示在[Content_Types].xml和关系文件中。

hope that (still) helps, regards, carsten 希望(仍然)能够帮助,问候,卡登

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM