简体   繁体   中英

split document by using MarkLogic mlcp

I need to split this document

 <?xml version="1.0"?> <!DOCTYPE docs SYSTEM "../rom11.dtd"> <docs> <stwtext id="RD-10-00258" update="03.2011" seq="RQ-10-00001"> <head> <ti> <i>j</i> </ti> <ff-list> <ff id="0103" /> </ff-list> </head> <p> Symbol f&#x00FC;r die <vw idref="RD-19-04447">Stromdichte</vw> . </p> </stwtext> <stwtext id="RD-10-00209" update="12.2007" seq="RQ-10-00223"> <head> <ti>JZ</ti> <ff-list> <ff id="0932" /> </ff-list> </head> <p> Abk&#x00FC;rzung f&#x00FC;r Jod-Zahl, siehe <vw idref="RD-06-00645">Fettkennzahlen</vw> . </p> </stwtext> </docs> 

i do it with this command:

~> bin/mlcp.sh IMPORT -mode local -host localhost -port 15000 \ 
  -username admin -password admin \
  -input_file_path /media/sf_vm.shared/theme/rom-training/v10.new-ML.XML \
  -output_uri_replace "/media/sf_vm.shared/theme/rom-training/keywords,'rom-data'" \
  -output_collections rom-data \
  -input_file_type aggregates -aggregate_record_element stwtext \
  -aggregate_uri_id @id

The command works fine, but I see in MarkLogic the documents with ids, which don't belong to declared stwtext.id, but to the id of last element. For example, for my document I am expecting to see

RD-10-00258
RD-10-00260

but actually it looks like this:

0103
0932

Is it bug, or perhaps I did something wrong ? thanks

It's a bug. If you'd like to, you can download the source code for MLCP and change it. Take a look at AggregateXMLReader.java's processStartElement() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM