简体   繁体   中英

How to load & query word/excel documents in Marklogic Server?

I want to load MS Office word/excel documents into Marklogic and want to query them with xquery as done with xml documents. But when I load doc files into Marklogic it loads them into binary format and shows junk characters when viewed with query console. I tried loading with following command:-

xdmp:document-load("E:\doc\sample.doc", 
    <options xmlns="xdmp:document-load"
             xmlns:http="xdmp:http">
      <format>xml</format>
    </options>)

But it shows an error that says document is not UTF 8 Encoded. I want to know whether doc and xls files can be loaded as it is into Marklogic or they have to be converted to xml or UTF 8 Encoded format before loading them. If yes, then what is the process of converting them. If no, then how can we query them with xquery. I also want to know if MS Office 2007/2010 installation is necessary for the conversion process because both Office 2007 & 2010 support OOXML format.

Please give me proper guidance about this.

Grtjn's reply is correct if you're dealing with Office documents in a format prior to 2007/2010. For 2007/2010 documents, enable the "Office OpenXML ExtractOffice OpenXML Extract" pipeline in CPF and reload the documents. This pipeline does not require the additional conversion option. It will load the source XML as-is.

Office 2007/2010 docs are just .zip files containing interrelated XML parts. This pipeline will unzip any .docx, .xlsx, .pptx docs and save their component parts in a directory named after the source document. The directory will be saved as a sibling to the source document and will be linked to the source, so as an example, if you delete the source .docx, the directory containing the extracted parts will also be deleted.

Make sure automatic directory creation is set to true for the database. (This is the default setting for MarkLogic 5.0 and prior versions).

They are binary, so they should be inserted as binary. But you want them to be converted. MarkLogic can do that for you automatically. To do so do the following:

  • Open the Admin interface
  • Go to the appropriate database
  • Open the Content Processing page
  • Open the Install tab, set the 'enable conversion' toggle to 'true', and hit install
  • Check the scope of the domain to make sure you are inserting within that scope, eg insert documents at a database uri that starts with the scope uri. (this most likely means you need to add a uri option to xdmp:document-load that starts with /)
  • Check the pipelines to see which types of content are being converted automatically, and to which format (most typically xhtml or docbook)
  • Rerun the xdmp:document-load

The Content Processing Framework will create additional files containing the conversion results. This usually consists of an xhtml with the text, separate image files if there are any, css with layout properties, etc.

This does require a license with the conversion option.

HTH!

OOXML

.doc and .xls are binary files which cannot be processed by XQuery processors directly.

Use OOXML like you mentioned. Save the files as .docx or .xlsx which are zipped XML files (with some more ressources like images in the zip folders). Maybe the Marklogic zip module can help you extracting the files.

Using MS Office 2003

This can also be done using MS Office 2003 with the File Format Compatibility Pack installed. I'm sorry I cannot help you with batch conversion, but sure there is some way to do this using VBA - ask another question if needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM