[英]How to load & query word/excel documents in Marklogic Server?
I want to load MS Office word/excel documents into Marklogic and want to query them with xquery as done with xml documents. 我想将MS Office word / excel文档加载到Marklogic中,并希望使用xquery查询它们,就像使用xml文档一样。 But when I load doc files into Marklogic it loads them into binary format and shows junk characters when viewed with query console. 但是当我将doc文件加载到Marklogic时,它会将它们加载成二进制格式,并在使用查询控制台查看时显示垃圾字符。 I tried loading with following command:- 我尝试使用以下命令加载: -
xdmp:document-load("E:\doc\sample.doc",
<options xmlns="xdmp:document-load"
xmlns:http="xdmp:http">
<format>xml</format>
</options>)
But it shows an error that says document is not UTF 8 Encoded. 但它显示一个错误,说文件不是UTF 8编码。 I want to know whether doc and xls files can be loaded as it is into Marklogic or they have to be converted to xml or UTF 8 Encoded format before loading them. 我想知道是否可以将doc和xls文件加载到Marklogic中,或者在加载它们之前必须将它们转换为xml或UTF 8编码格式。 If yes, then what is the process of converting them. 如果是,那么转换它们的过程是什么。 If no, then how can we query them with xquery. 如果不是,那么我们如何使用xquery查询它们。 I also want to know if MS Office 2007/2010 installation is necessary for the conversion process because both Office 2007 & 2010 support OOXML format. 我还想知道转换过程是否需要安装MS Office 2007/2010,因为Office 2007和2010都支持OOXML格式。
Please give me proper guidance about this. 请给我适当的指导。
Grtjn's reply is correct if you're dealing with Office documents in a format prior to 2007/2010. 如果您以2007/2010之前的格式处理Office文档,Grtjn的回复是正确的。 For 2007/2010 documents, enable the "Office OpenXML ExtractOffice OpenXML Extract" pipeline in CPF and reload the documents. 对于2007/2010文档,在CPF中启用“Office OpenXML ExtractOffice OpenXML Extract”管道并重新加载文档。 This pipeline does not require the additional conversion option. 此管道不需要额外的转换选项。 It will load the source XML as-is. 它将按原样加载源XML。
Office 2007/2010 docs are just .zip files containing interrelated XML parts. Office 2007/2010 docs只是包含相互关联的XML部分的.zip文件。 This pipeline will unzip any .docx, .xlsx, .pptx docs and save their component parts in a directory named after the source document. 此管道将解压缩任何.docx,.xlsx,.pptx文档,并将其组件部分保存在源文档之后命名的目录中。 The directory will be saved as a sibling to the source document and will be linked to the source, so as an example, if you delete the source .docx, the directory containing the extracted parts will also be deleted. 该目录将保存为源文档的兄弟,并将链接到源,因此,如果删除源.docx,则还将删除包含提取的部分的目录。
Make sure automatic directory creation is set to true for the database. 确保数据库的自动目录创建设置为true。 (This is the default setting for MarkLogic 5.0 and prior versions). (这是MarkLogic 5.0及之前版本的默认设置)。
They are binary, so they should be inserted as binary. 它们是二进制的,因此它们应该作为二进制插入。 But you want them to be converted. 但是你希望它们被转换。 MarkLogic can do that for you automatically. MarkLogic可以自动为您完成。 To do so do the following: 为此,请执行以下操作:
The Content Processing Framework will create additional files containing the conversion results. 内容处理框架将创建包含转换结果的其他文件。 This usually consists of an xhtml with the text, separate image files if there are any, css with layout properties, etc. 这通常包含带文本的xhtml,单独的图像文件(如果有),带有布局属性的css等。
This does require a license with the conversion option. 这需要带有转换选项的许可证。
HTH! HTH!
.doc
and .xls
are binary files which cannot be processed by XQuery processors directly. .doc
和.xls
是二进制文件,XQuery处理器无法直接处理。
Use OOXML like you mentioned. 像你提到的那样使用OOXML 。 Save the files as .docx
or .xlsx
which are zipped XML files (with some more ressources like images in the zip folders). 将文件保存为.docx
或.xlsx
,这些文件是压缩的XML文件(在zip文件夹中有一些更多的资源,如图像)。 Maybe the Marklogic zip module can help you extracting the files. 也许Marklogic zip模块可以帮助您提取文件。
This can also be done using MS Office 2003 with the File Format Compatibility Pack installed. 这也可以使用安装了文件格式兼容包的 MS Office 2003来完成。 I'm sorry I cannot help you with batch conversion, but sure there is some way to do this using VBA - ask another question if needed. 对不起,我无法帮助您进行批量转换,但确定有一些方法可以使用VBA执行此操作 - 如果需要,可以提出另一个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.