简体   繁体   English

从大型xml文件中提取大xml块的最佳方法

[英]Best way to extract big xml block from large xml file

I am extracting big blocks from XML files by using XPath. 我使用XPath从XML文件中提取大块。 My xml files are large, they are from PubMed. 我的xml文件很大,它们来自PubMed。 An example of my file type is: 我的文件类型的一个例子是:

ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0001.xml.gz ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/medline17n0001.xml.gz

So, by using 所以,通过使用

 Node result = (Node)xPath.evaluate("PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]", doc, XPathConstants.NODE);

I get the article with PMIDtoSearch, so its perfect. 我收到了PMIDtoSearch的文章,所以它很完美。 But it takes much time. 但这需要很长时间。 I have to do it around 800.000 times, so with this solution it would take more than two months. 我必须做大约800.000次,所以使用这个解决方案需要两个多月。 Some blocks has more than 400 lines and each xml file has more than 4 millions of lines. 有些块有超过400行,每个xml文件有超过4百万行。

I also have tried a solution like this getElementsByTagName function but it takes almost the same time. 我也尝试过像这个getElementsByTagName函数这样的解决方案,但它几乎需要相同的时间。

Do you know how improve the solution? 你知道如何改进解决方案吗?

Thanks. 谢谢。

I took your document and loaded into exist-db then executed your query, essentially this: 我把你的文件加载到exists-db然后执行你的查询,基本上这样:

xquery version "3.0";
let $medline := '/db/Medline/Data'
let $doc := 'medline17n0001.xml'
let $PMID := request:get-parameter("PMID", "")
let $article := doc(concat($medline,'/',$doc))/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID=$PMID]
return
$article

The document is returned in 400 milliseconds from a remote server. 该文档在400毫秒内从远程服务器返回。 If I beefed up that server, I would expect less than that and it could handle multiple concurrent requests. 如果我加强了那个服务器,我会期望少于那个,它可以处理多个并发请求。 Or if you had everything local even faster. 或者,如果您拥有本地更快的一切。

Try it yourself, I left the data in a test server (and remember this is querying remote to a Amazon micro server in California): 自己尝试一下,我将数据留在测试服务器中(并记住这是在远程查询加利福尼亚的亚马逊微服务器):

http://54.241.15.166/get-article2.xq?PMID=8 http://54.241.15.166/get-article2.xq?PMID=8

http://54.241.15.166/get-article2.xq?PMID=6 http://54.241.15.166/get-article2.xq?PMID=6

http://54.241.15.166/get-article2.xq?PMID=1 http://54.241.15.166/get-article2.xq?PMID=1

And of course, that entire document is there. 当然,整个文件都在那里。 You can just change that query to PMID=667 or 999 or whatever and get the target document fragment back. 您只需将该查询更改为PMID = 667或999或其他任何内容,然后返回目标文档片段。

As @KevinBrown suggests, a database might well be the right answer. 正如@KevinBrown所说,数据库可能是正确的答案。 But if this is a one-time process there are probably solutions that work a lot faster than yours but don't require the complexity of learning how to set up an XML database. 但是,如果这是一次性过程,可能解决方案比您的解决方案工作得快得多,但不需要学习如何设置XML数据库的复杂性。

In the approach you are using, there are two main costs: parsing the XML documents to create a tree in memory, and then searching the in-memory document to find a particular ID value. 在您使用的方法中,有两个主要成本:解析XML文档以在内存中创建树,然后搜索内存中的文档以查找特定的ID值。 I would guess that the parsing cost is probably an order of magnitude greater than the searching cost. 我猜想解析成本可能比搜索成本高出一个数量级。

So there are two ingredients to getting good performance for this: 因此,为此获得良好表现有两个因素:

  • first, you need to make sure you are only parsing each source document once (rather than once per query). 首先,您需要确保只解析每个源文档一次(而不是每个查询一次)。 You haven't told us enough for me to be able to tell whether you are already doing this. 你还没告诉我们足够让我知道你是否已经这样做了。

  • second, if you are retrieving many chunks of data from a single document, you want to do this without doing a serial search for each one. 第二,如果要从单个文档中检索许多数据块,则需要在不对每个文档进行串行搜索的情况下执行此操作。 The best way to achieve this is to use a query processor that builds an index to optimize the query (such as Saxon-EE). 实现此目的的最佳方法是使用构建索引的查询处理器来优化查询(例如Saxon-EE)。 Alternatively, you could build indexes "by hand", for example by using XQuery 3.1 maps, or by using the xsl:key feature in XSLT. 或者,您可以“手动”构建索引,例如使用XQuery 3.1映射,或者使用XSLT中的xsl:key功能。

This is the code that does the xpath querying.. on my laptop, the results looks decent.. it took about sub 1 second regardless of pmid value. 这是执行xpath查询的代码..在我的笔记本电脑上,结果看起来不错..不管pmid值大约需要1秒。 How do you intend to extract the text. 你打算如何提取文本。 I can update the code to target that. 我可以更新代码来定位它。

public static void main(String[] args) throws VTDException{
        VTDGen vg = new VTDGen();
        if (!vg.parseFile("d:\\xml\\medline17n0001.xml", false))
            return;
        VTDNav vn = vg.getNav();
        AutoPilot ap = new AutoPilot(vn);
        System.out.println("nesting level"+vn.getNestingLevel());
        String PMIDtoSearch =  "30000";
        ap.selectXPath("/PubmedArticleSet/PubmedArticle[MedlineCitation/PMID = "+PMIDtoSearch+"]");
        System.out.println("====>"+ap.getExprString());
        int i=0,count=0;
        System.out.println(" token count ====> "+ vn.getTokenCount() );
        while((i=ap.evalXPath())!=-1){
            count++;
            System.out.println("string ====>"+vn.toString(i));
        }
        System.out.println(" count ===> "+count);
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM