如何在Marklogic Server中加载和查询word / excel文档？

Question

I want to load MS Office word/excel documents into Marklogic and want to query them with xquery as done with xml documents. 我想将MS Office word / excel文档加载到Marklogic中，并希望使用xquery查询它们，就像使用xml文档一样。 But when I load doc files into Marklogic it loads them into binary format and shows junk characters when viewed with query console. 但是当我将doc文件加载到Marklogic时，它会将它们加载成二进制格式，并在使用查询控制台查看时显示垃圾字符。 I tried loading with following command:- 我尝试使用以下命令加载： -

xdmp:document-load("E:\doc\sample.doc", 
    <options xmlns="xdmp:document-load"
             xmlns:http="xdmp:http">
      <format>xml</format>
    </options>)

But it shows an error that says document is not UTF 8 Encoded. 但它显示一个错误，说文件不是UTF 8编码。 I want to know whether doc and xls files can be loaded as it is into Marklogic or they have to be converted to xml or UTF 8 Encoded format before loading them. 我想知道是否可以将doc和xls文件加载到Marklogic中，或者在加载它们之前必须将它们转换为xml或UTF 8编码格式。 If yes, then what is the process of converting them. 如果是，那么转换它们的过程是什么。 If no, then how can we query them with xquery. 如果不是，那么我们如何使用xquery查询它们。 I also want to know if MS Office 2007/2010 installation is necessary for the conversion process because both Office 2007 & 2010 support OOXML format. 我还想知道转换过程是否需要安装MS Office 2007/2010，因为Office 2007和2010都支持OOXML格式。

Please give me proper guidance about this. 请给我适当的指导。

Answer 1

Grtjn's reply is correct if you're dealing with Office documents in a format prior to 2007/2010. 如果您以2007/2010之前的格式处理Office文档，Grtjn的回复是正确的。 For 2007/2010 documents, enable the "Office OpenXML ExtractOffice OpenXML Extract" pipeline in CPF and reload the documents. 对于2007/2010文档，在CPF中启用“Office OpenXML ExtractOffice OpenXML Extract”管道并重新加载文档。 This pipeline does not require the additional conversion option. 此管道不需要额外的转换选项。 It will load the source XML as-is. 它将按原样加载源XML。

Office 2007/2010 docs are just .zip files containing interrelated XML parts. Office 2007/2010 docs只是包含相互关联的XML部分的.zip文件。 This pipeline will unzip any .docx, .xlsx, .pptx docs and save their component parts in a directory named after the source document. 此管道将解压缩任何.docx，.xlsx，.pptx文档，并将其组件部分保存在源文档之后命名的目录中。 The directory will be saved as a sibling to the source document and will be linked to the source, so as an example, if you delete the source .docx, the directory containing the extracted parts will also be deleted. 该目录将保存为源文档的兄弟，并将链接到源，因此，如果删除源.docx，则还将删除包含提取的部分的目录。

Make sure automatic directory creation is set to true for the database. 确保数据库的自动目录创建设置为true。 (This is the default setting for MarkLogic 5.0 and prior versions). （这是MarkLogic 5.0及之前版本的默认设置）。

Answer 2

They are binary, so they should be inserted as binary. 它们是二进制的，因此它们应该作为二进制插入。 But you want them to be converted. 但是你希望它们被转换。 MarkLogic can do that for you automatically. MarkLogic可以自动为您完成。 To do so do the following: 为此，请执行以下操作：

Open the Admin interface 打开Admin界面
Go to the appropriate database 转到相应的数据库
Open the Content Processing page 打开“内容处理”页面
Open the Install tab, set the 'enable conversion' toggle to 'true', and hit install 打开“安装”选项卡，将“启用转换”切换为“true”，然后点击“安装”
Check the scope of the domain to make sure you are inserting within that scope, eg insert documents at a database uri that starts with the scope uri. 检查域的范围以确保您在该范围内插入，例如将文档插入以范围uri开头的数据库uri。 (this most likely means you need to add a uri option to xdmp:document-load that starts with /) （这很可能意味着你需要为xdmp添加一个uri选项：以/开头的文档加载）
Check the pipelines to see which types of content are being converted automatically, and to which format (most typically xhtml or docbook) 检查管道以查看自动转换的内容类型以及格式（最常见的是xhtml或docbook）
Rerun the xdmp:document-load 重新运行xdmp：document-load

The Content Processing Framework will create additional files containing the conversion results. 内容处理框架将创建包含转换结果的其他文件。 This usually consists of an xhtml with the text, separate image files if there are any, css with layout properties, etc. 这通常包含带文本的xhtml，单独的图像文件（如果有），带有布局属性的css等。

This does require a license with the conversion option. 这需要带有转换选项的许可证。

HTH! HTH！

Answer 3

OOXML OOXML

.doc and .xls are binary files which cannot be processed by XQuery processors directly. .doc和.xls是二进制文件，XQuery处理器无法直接处理。

Use OOXML like you mentioned. 像你提到的那样使用OOXML 。 Save the files as .docx or .xlsx which are zipped XML files (with some more ressources like images in the zip folders). 将文件保存为.docx或.xlsx ，这些文件是压缩的XML文件（在zip文件夹中有一些更多的资源，如图像）。 Maybe the Marklogic zip module can help you extracting the files. 也许Marklogic zip模块可以帮助您提取文件。

Using MS Office 2003 使用MS Office 2003

This can also be done using MS Office 2003 with the File Format Compatibility Pack installed. 这也可以使用安装了文件格式兼容包的 MS Office 2003来完成。 I'm sorry I cannot help you with batch conversion, but sure there is some way to do this using VBA - ask another question if needed. 对不起，我无法帮助您进行批量转换，但确定有一些方法可以使用VBA执行此操作 - 如果需要，可以提出另一个问题。

如何在Marklogic Server中加载和查询word / excel文档？

问题描述

3 个解决方案

解决方案1
6 2012-05-31 15:16:46

解决方案2
3 2012-05-31 13:53:59

解决方案3
0 2012-05-31 13:48:13

OOXML OOXML

Using MS Office 2003 使用MS Office 2003

如何在Marklogic Server中加载和查询word / excel文档？

问题描述

3 个解决方案

解决方案1 6 2012-05-31 15:16:46

解决方案2 3 2012-05-31 13:53:59

解决方案3 0 2012-05-31 13:48:13

OOXML OOXML

Using MS Office 2003 使用MS Office 2003

解决方案1
6 2012-05-31 15:16:46

解决方案2
3 2012-05-31 13:53:59

解决方案3
0 2012-05-31 13:48:13