简体   繁体   中英

Best way to extract big xml block from large xml file

I am extracting big blocks from XML files by using XPath. My xml files are large, they are from PubMed. An example of my file type is:

ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0001.xml.gz

So, by using

 Node result = (Node)xPath.evaluate("PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]", doc, XPathConstants.NODE);

I get the article with PMIDtoSearch, so its perfect. But it takes much time. I have to do it around 800.000 times, so with this solution it would take more than two months. Some blocks has more than 400 lines and each xml file has more than 4 millions of lines.

I also have tried a solution like this getElementsByTagName function but it takes almost the same time.

Do you know how improve the solution?

Thanks.

I took your document and loaded into exist-db then executed your query, essentially this:

xquery version "3.0";
let $medline := '/db/Medline/Data'
let $doc := 'medline17n0001.xml'
let $PMID := request:get-parameter("PMID", "")
let $article := doc(concat($medline,'/',$doc))/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID=$PMID]
return
$article

The document is returned in 400 milliseconds from a remote server. If I beefed up that server, I would expect less than that and it could handle multiple concurrent requests. Or if you had everything local even faster.

Try it yourself, I left the data in a test server (and remember this is querying remote to a Amazon micro server in California):

http://54.241.15.166/get-article2.xq?PMID=8

http://54.241.15.166/get-article2.xq?PMID=6

http://54.241.15.166/get-article2.xq?PMID=1

And of course, that entire document is there. You can just change that query to PMID=667 or 999 or whatever and get the target document fragment back.

As @KevinBrown suggests, a database might well be the right answer. But if this is a one-time process there are probably solutions that work a lot faster than yours but don't require the complexity of learning how to set up an XML database.

In the approach you are using, there are two main costs: parsing the XML documents to create a tree in memory, and then searching the in-memory document to find a particular ID value. I would guess that the parsing cost is probably an order of magnitude greater than the searching cost.

So there are two ingredients to getting good performance for this:

  • first, you need to make sure you are only parsing each source document once (rather than once per query). You haven't told us enough for me to be able to tell whether you are already doing this.

  • second, if you are retrieving many chunks of data from a single document, you want to do this without doing a serial search for each one. The best way to achieve this is to use a query processor that builds an index to optimize the query (such as Saxon-EE). Alternatively, you could build indexes "by hand", for example by using XQuery 3.1 maps, or by using the xsl:key feature in XSLT.

This is the code that does the xpath querying.. on my laptop, the results looks decent.. it took about sub 1 second regardless of pmid value. How do you intend to extract the text. I can update the code to target that.

public static void main(String[] args) throws VTDException{
        VTDGen vg = new VTDGen();
        if (!vg.parseFile("d:\\xml\\medline17n0001.xml", false))
            return;
        VTDNav vn = vg.getNav();
        AutoPilot ap = new AutoPilot(vn);
        System.out.println("nesting level"+vn.getNestingLevel());
        String PMIDtoSearch =  "30000";
        ap.selectXPath("/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]");
        System.out.println("====>"+ap.getExprString());
        int i=0,count=0;
        System.out.println(" token count ====> "+ vn.getTokenCount() );
        while((i=ap.evalXPath())!=-1){
            count++;
            System.out.println("string ====>"+vn.toString(i));
        }
        System.out.println(" count ===> "+count);
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM