简体   繁体   English

解释由Microsoft Office Word和Excel文档生成的xml

[英]interpreting xml produced by microsoft office word and excel documents

I wish to build a free system to extract the text, text formatting (eg bold etc) and images contents inside documents such as excel and word. 我希望建立一个免费的系统来提取文本(例如excel和word)中的文本,文本格式(例如,粗体等)和图像内容。

In my research I have found that the structure of excel (xlsx) and word (docx) documents is defined in xml once you extract the document with a compression utility like 7zip. 在我的研究中,我发现,一旦使用7zip之类的压缩实用程序提取了excel(xlsx)和word(docx)文档的结构,便在xml中定义了它们。

I am skilled in VBA, however I have not been able to find an object model (listing ALL objects and methods that can be applied /manipulated for any of: 我精通VBA,但是我无法找到对象模型(列出可以针对以下任何一种方法应用/操纵的所有对象和方法:

  1. Excel VBA Excel VBA
  2. Word VBA Word VBA
  3. Word XML Word XML
  4. Excel XML Excel XML

I know many excel vba objects already however that is just through trial and error and experimentation, and not through reading an object model where the methods/objects are defined! 我已经知道许多excel vba对象,但这只是通过反复试验和实验,而不是通过读取定义了方法/对象的对象模型!

The problem 问题

  • I don't know how to interpret the XML because I don't have an object model showing me that and means bold etc 我不知道如何解释XML,因为我没有一个对象模型显示我手段等大胆

I am trying to develop a tool which looks through the xml to find: 我正在尝试开发一种通过xml查找以查找的工具:

  1. The location of any images in the document, both the relative directory (in the Directory / Word / Media folder) and the actual file path, eg C:\\documents\\josh\\img1.png 文档中任何图像的位置,相对目录(在Directory / Word / Media文件夹中)和实际文件路径,例如C:\\ documents \\ josh \\ img1.png
  2. The position of any text in the document (I'm thinking in terms of lines, reading a document from top to bottom, as well as alignment like centre etc) SO I can reproduce the text in the right order. 文档中任何文本的位置(我在思考线条,从上到下阅读文档,以及居中对齐等),因此我可以按正确的顺序复制文本。
  3. The formatting applied to the text (bold, some font, some size? 应用于文本的格式(粗体,某些字体,某些大小?

Please help me find an object model or some way to interpret or parse this 请帮助我找到一个对象模型或某种方式来解释或解析它

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"><w:body><w:p w:rsidR="001920B6" w:rsidRDefault="001920B6" w:rsidP="001920B6"><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/><w:r><w:rPr><w:noProof/></w:rPr><w:drawing><wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="4B104522" wp14:editId="4A3907E9"><wp:simplePos x="0" y="0"/><wp:positionH relativeFrom="column"><wp:posOffset>0</wp:posOffset></wp:positionH><wp:positionV relativeFrom="paragraph"><wp:posOffset>1209675</wp:posOffset></wp:positionV><wp:extent cx="5943600" cy="3343275"/><wp:effectExtent l="0" t="0" r="0" b="9525"/><wp:wrapTight wrapText="bothSides"><wp:wrapPolygon edited="0"><wp:start x="0" y="0"/><wp:lineTo x="0" y="21538"/><wp:lineTo x="21531" y="21538"/><wp:lineTo x="21531" y="0"/><wp:lineTo x="0" y="0"/></wp:wrapPolygon></wp:wrapTight><wp:docPr id="1" name="Picture 1"/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicPr><pic:cNvPr id="0" name="windows.png"/><pic:cNvPicPr/></pic:nvPicPr><pic:blipFill><a:blip r:embed="rId7" cstate="print"><a:extLst><a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}"><a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/></a:ext></a:extLst></a:blip><a:stretch><a:fillRect/></a:stretch></pic:blipFill><pic:spPr><a:xfrm><a:off x="0" y="0"/><a:ext cx="5943600" cy="3343275"/></a:xfrm><a:prstGeom prst="rect"><a:avLst/></a:prstGeom></pic:spPr></pic:pic></a:graphicData></a:graphic><wp14:sizeRelH relativeFrom="page"><wp14:pctWidth>0</wp14:pctWidth></wp14:sizeRelH><wp14:sizeRelV relativeFrom="page"><wp14:pctHeight>0</wp14:pctHeight></wp14:sizeRelV></wp:anchor></w:drawing></w:r><w:r w:rsidR="00327DB9"><w:rPr><w:noProof/></w:rPr><w:t>Plain text</w:t></w:r></w:p><w:p w:rsidR="00327DB9" w:rsidRDefault="00327DB9" w:rsidP="001920B6"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r><w:rPr><w:b/></w:rPr><w:t>bold</w:t></w:r><w:r w:rsidR="0009704D" w:rsidRPr="00327DB9"><w:rPr><w:b/></w:rPr><w:t xml:space="preserve"> text</w:t></w:r></w:p><w:p w:rsidR="00327DB9" w:rsidRPr="00327DB9" w:rsidRDefault="00327DB9" w:rsidP="00327DB9"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>heading</w:t></w:r></w:p><w:p w:rsidR="00327DB9" w:rsidRPr="001920B6" w:rsidRDefault="00327DB9"/><w:sectPr w:rsidR="00327DB9" w:rsidRPr="001920B6"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

Questions about the xml 有关xml的问题

  1. Determining the position of the image (which is at the bottom) relative to the text (which is above it) 确定图像(在底部)相对于文本(在其上方)的位置
  2. How many images there are? 有多少张图片? Is there one because the picture has an ID or INDEX of 0? 是否存在一张图片,因为图片的ID或INDEX为0?

Take a look at Office Open XML, that's the xml-structure of all MS-Office documents: http://openxmldeveloper.org/ . 看一下Office Open XML,它是所有MS-Office文档的xml结构: http : //openxmldeveloper.org/ There is a rahter good ebook which explains the basics: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2007/08/13/1970.aspx 有一本不错的电子书,其中介绍了基础知识: http ://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2007/08/13/1970.aspx

But bewarE: parsing or interpreting Office Open XML is a extremely huge task, especially in VBA which is ill-suited for this job. 但是要当心:解析或解释Office Open XML是一项极其艰巨的任务,尤其是在不适合该工作的VBA中。 There are numerous libraries in C# / VB.net which can read office open xml documents, which would be a better starting point. C#/ VB.net中有许多库可以读取Office Open xml文档,这将是一个更好的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM