简体   繁体   中英

How to filter out non-json documents in MarkLogic?

I have a lot of data loaded in my database where some of the documents loaded are not JSON files & just binary files. Correct data looks like this: "/foo/bar/1.json" but the incorrect data is in the format of "/foo/bar/*". Is there a mechanism in MarkLogic using JavaScript where I can filter out this junk data and delete them? PS: I'm unable to extract files with mlcp that have a "?" in the URI and maybe when I try to reload this data I get this error. Any way to fix that extract along with this?

If all of the document URIs contain a ? and are in that directory, then you could use cts.uriMatch()

declareUpdate();
for (const uri of cts.uriMatch('/foo/bar/*?*') ) {
  xdmp.documentDelete(uri)  
}

Alternatively, if you are looking to find the binary() documents, you can apply the format-binary option to a cts.search() with a cts.directoryQuery() and then delete them.

declareUpdate();
for (const doc of cts.search(cts.directoryQuery("/foo/bar/"), ['format-json']) ) {
  xdmp.documentDelete(fn.baseUri(doc));
}

They are probably being persisted as binary because there is no file extension when the URI ends with a question mark and some querystring parameter values ie "1.json?foo=bar" instead of 1.json`

It is difficult to diagnose and troubleshoot without seeing what your MLCP job configs are and knowing more about what you are doing to load the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM