简体   繁体   中英

Search XML document with the largest size in a MarkLogic database

I want to search for the largest XML file in a MarkLogic database from the MarkLogic query console using XQuery. I can retrieve the size (bytes) of a document in the database using the following XQuery:

xdmp:binary-size(xdmp:unquote(xdmp:quote($doc),(),"format-binary")/binary())

where $doc is the document for which i get the size in bytes.

It is true that there is no index on document size to quickly find the largest ones. But there are some options to find large documents.

One is to run a batch job that scans for large documents using the function above to compute the size. Also it's a little simpler to use the serialized length with XQuery string-length(xdmp:quote(doc($uri))) or JavaScript xdmp.quote(cts.doc("/my/uri/here")).length .

Corb or NiFi or spawning functions on the task server via xdmp.spawnFunction() can execute a big job like that over a period of time, where you would check each documents size and store a record or log an indicator if it was over some size limit. You would then search or grep for the largest size.

Sometimes, if you know the structure and some common terms that will be in a larger document, you can search for documents that contain a common "word" or "term" many times using cts.wordQuery("theCommonTerm") and the option "min-occurs=number". You need to adjust the min-occurs number to narrow things down to the largest documents, then run your size query just on those.

I found the following query useful:

(
for $doc in cts:uri-match('*.xml')
order by string-length(fn:doc($doc)) descending
return $doc
)[position() = 1]

The above query uses string-length function to find the number of characters in the document. This query is useful when you have special characters in the document being searched.

If you want the number of bytes you can use xdmp:binary-size as follows:

(
for $doc in cts:uri-match('*.xml')
order by xdmp:binary-size(xdmp:unquote(xdmp:quote(fn:doc($doc)),(),"format-binary")/binary()) descending
return $doc
)[position() = 1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM