简体   繁体   中英

Convert the word 2007 file in to XML

I have a word 2007 .docx document. I have created its zip file by adding the extention ".zip" at the end. When I extract the zip file it gave me few folders with xml files in it. I want to get a single xml sheet by combining all the xml files that are there in the folders, in order to write a xsl style sheet for that. I do not want to open the ".docx" file and try "save as xml". Is there a way to do that? Or can I atleast have the WordML file of that document? If so how. Thank you in advance.

use the tool(set) of your choice that supports unzipping, directory tree walking and line-based text file processing. unzip your word file first, preserving the directory structure of the archive. next lauch your directory walker on the directory you unpacked into, processing all .xml and .rels files; delete the first line (containing the xml declaration, eg <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ) from each of them and concatenate each in turn to your global output xml file, the first line of which should contain an xml declaration itself. make sure that your tools respect the charset encoding of the xml files (which should be utf-8).

The Perl packages File::Find and Archive::Zip come handy for this task but you can get your job done with standard cli tools (zip/unzip, find, cat, sed).

You may have to complement the toplevel Relationships elements of the .rels files with some synthetic distinguishing attribute to avoid id clashes - the applicability of most relationship entries should be unique given the Type attribute but the ms specs appear a bit vague on whether office itself guarantees unique ids over all relationship items of the same kind (or i haven't read the specs thoroughly enough ...). note that names of relevant non-xml files (graphics, vba code) show up in [Content_Types].xml and the relationships file.

hope that (still) helps, regards, carsten

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM