如何提高使用VTD-XML和XPath查询xml文件的性能？

Question

I am querying XML files with size of around 1 MB(20k+ lines). 我正在查询大小约为1 MB（20k +行）的XML文件。 I am using XPath to describe what I want to get and VTD-XML library to get it. 我正在使用XPath来描述我想要的内容和VTD-XML库来获取它。 I think that I have some problems with performance. 我认为我在性能方面存在一些问题。

The problem is, I am making about 5k+ queries to XML file. 问题是，我正在对XML文件进行大约5k +查询。 It takes approximately 16-17 seconds to retrieve all values. 检索所有值大约需要16-17秒。 I want to ask you, if this is normal performance for such task? 我想问你，这个任务是否正常？ How I can improve it? 我怎么能改进它？

I am using VTD-XML library with AutoPilot navigation approach which give me opportunity to use XPath. 我正在使用带有AutoPilot导航方法的VTD-XML库，这让我有机会使用XPath。 Implementation is as following: 实施如下：

private VTDGen vg = new VTDGen();
private VTDNav vn;
private AutoPilot ap = new AutoPilot();

public void init(String xml) {
    log.info("Creating document");
    xml = xml.replace("<?xml version=\"1.0\"?>", "<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
    byte[] bytes = xml.getBytes(StandardCharsets.UTF_8);
    vg.setDoc(bytes);
    try {
        vg.parse(true);
        vn = vg.getNav();
    } catch (ParseException e) {
        e.printStackTrace();
    }
    log.info("Document created");
}

public String parseXmlOrReturnNull(String query) {
    String xPathStringVal = null;
    try {
        ap.selectXPath(query);
        ap.bind(vn);
        int i = -1;
        while ((i = ap.evalXPath()) != -1) {
            xPathStringVal = vn.getXPathStringVal();
        }
    }catch (XPathEvalException e) {
        e.printStackTrace();
    } catch (NavException e) {
        e.printStackTrace();
    } catch (XPathParseException e) {
        e.printStackTrace();
    }
    return xPathStringVal;
}

My xml files have specific format, they are divided into lot of parts - segments, and my queries are same for all segments(I am querying it in a loop). 我的xml文件有特定的格式，它们被分成很多部分 - 段，我的查询对于所有段都是相同的（我在循环中查询它）。 For example part of xml: 例如xml的一部分：

<segment>
    <a>
        <b>value1</b>
        <c>
            <d>value2</d>
            <e>value3</d>
        </c>
    </a>
</segment>
<segment>
    <a>
        <b>value4</b>
        <c>
            <d>value5</d>
            <e>value6</d>
            <f>value6</d>
        </c>
    </a>
</segment>
...

If I want to get value1 in first segment I am using query: 如果我想在第一个段中获取value1，我使用查询：

//segment[1]/a/b

for value 4 in second segment 对于第二段中的值4

//segment[2]/a/b

etc. 等等

Intuition says a few things: in my approach every query is independent (it doesn't know anything about other query), it means that AutoPilot, my iterator, always starts at the beginning of the file when I want to query it. Intuition说了一些事情：在我的方法中，每个查询都是独立的（它对其他查询一无所知），这意味着当我想查询它时，AutoPilot，我的迭代器总是从文件的开头开始。

My question is: Is there any way to set AutoPilot at the beginning of processing segment? 我的问题是：有没有办法在处理段开始时设置AutoPilot？ And when I finish querying move AutoPilot to next segment? 当我完成查询后，将AutoPilot移至下一个细分市场？ I think that if my method will start searching value not from the beginning but from specifying point It will be much faster. 我认为如果我的方法不是从头开始搜索值，而是从指定点开始搜索它会更快。

Another way is to divide xml file into small xml files (one xml file = one segment) and querying those small xml files. 另一种方法是将xml文件分成小的xml文件（一个xml文件=一个段）并查询那些小的xml文件。

What do you think guys? 你们觉得怎么样？ Thanks in advance 提前致谢

Answer 1

Minor: The replace is not needed as UTF-8 is the default encoding; Minor：不需要替换，因为UTF-8是默认编码; only when there is an encoding, one would need to patch it to UTF-8. 只有当有一个编码，一个需要它修补为UTF-8。

The XPath should only done once, to not start from [0] to the next index. XPath应该只执行一次，不能从[0]开始到下一个索引。

If you need a List representation you could use JAXB with annotations. 如果需要List表示，可以使用带注释的JAXB。

An event based primitive parsing without DOM object probably is best (SAXParser). 基于事件的原始解析没有 DOM对象可能是最好的（SAXParser）。

Handler handler = new org.xml.sax.helpers.DefaultHandler {
    @Override
    public void startElement(String uri, 
        String localName, String qName, Attributes attributes) throws SAXException {
    }

    @Override
    public void endElement(String uri, 
        String localName, String qName) throws SAXException {
    }

    @Override
    public void characters(char ch[], int start, int length) throws SAXException {
    }
};
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
InputStream in = new ByteArrayInputStream(bytes);
parser.parse(in, handler);

如何提高使用VTD-XML和XPath查询xml文件的性能？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-04-15 14:20:33

如何提高使用VTD-XML和XPath查询xml文件的性能？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-04-15 14:20:33

解决方案1
0 已采纳 2019-04-15 14:20:33