简体   繁体   English

读取大文件时Java StAX解析器出现问题

[英]Issue with java StAX parser when reading a large file

I am trying to read a XML file using StAX parser which is having nearly 180k lines. 我正在尝试使用具有近180k行的StAX解析器读取XML文件。 Core logic looks for certain tags, attributes and stores in a data structure. 核心逻辑查找某些标签,属性并存储在数据结构中。 For this type of large files StAX parser is taking lot of time. 对于这种类型的大文件,StAX解析器要花费大量时间。 It is taking nearly 15 minutes without any core logic, just iterating over the while loop. 没有任何核心逻辑,只花了将近15分钟,只是在while循环中进行了迭代。

while (eventReader.hasNext()) { }

I tried SAX parser on the same file to just read the tags. 我在同一文件上尝试使用SAX解析器以仅读取标签。 It is very fast and completed in couple of seconds. 它非常快,几秒钟即可完成。

What would be the issue with StAX parser.? StAX解析器会出现什么问题? Please suggest any XML parser which is suitable for large files and perform well with respective of memory and space utilization. 请建议任何适合大文件的XML解析器,并在内存和空间利用率方面表现出色。 ?

Calling hasNext() will always return true unless you have reached the end of the input, and your code doesn't change position in the input because it never reads any data. 除非已到达输入的结尾,否则调用hasNext()始终将返回true,并且您的代码不会更改输入中的位置,因为它从不读取任何数据。 You need to call next() in the loop, then eventually hasNext() will return false. 您需要在循环中调用next() ,然后最终hasNext()将返回false。

Incidentally 180k lines is not a large file by modern standards. 顺便说一句,按照现代标准,180k行并不是一个大文件。

Stick with StAX parser as both SAX and Stax follows a Streaming programming model for parsing XML I ran the sample codes for both SAX and StAX here are the results 坚持使用StAX解析器,因为SAX和Stax都遵循Streaming编程模型来解析XML我为SAX和StAX都运行了示例代码,这是结果

SAX Parser: Total Time Taken:10.73 ms max memory:1842688 allocated memory:125952 free memory:107293 SAX解析器:总耗时:10.73毫秒最大内存:1842688分配的内存:125952可用内存:107293

StAX Parser: Total Time Taken:7.5 ms max memory:1842688 allocated memory:125952 free memory:120611 StAX解析器:总耗时:7.5毫秒最大内存:1842688分配的内存:125952可用内存:120611

StAX is a PULL API, whereas SAX is a PUSH API means in case of StAx Parser a client application calls methods on an XML parsing library when it needs to interact with an XML infoset--that is, the client only gets (pulls) XML data when it explicitly asks for it.But in case of SAX parser,an XML parser sends (pushes) XML data to the client as the parser encounters elements in an XML infoset--that is, the parser sends the data whether or not the client is ready to use it at that time. StAX是PULL API,而SAX是PUSH API,这意味着在StAx Parser的情况下,客户端应用程序需要与XML信息集进行交互时会调用XML解析库中的方法,也就是说,客户端仅获取(拉出)XML明确要求时返回数据。但是在使用SAX解析器的情况下,当XML解析器遇到XML信息集中的元素时,XML解析器会将XML数据发送(推送)到客户端,也就是说,无论是否使用XML信息集,解析器都会发送数据。客户当时准备使用它。 StAX API can read as well as write XML documents. StAX API可以读取和写入XML文档。 Using SAX API, an XML file can only be read. 使用SAX API,只能读取XML文件。

StAX Code: StAX代码:

public static void main(String[] args) throws FileNotFoundException, XMLStreamException {
        XMLInputFactory xf=XMLInputFactory.newInstance();
        XMLStreamReader xsr=xf.createXMLStreamReader(new InputStreamReader(new FileInputStream("C:\\Users\\RNayyar\\Desktop\\Context\\processedFiles\\post.xml")));
        String startElement = null;
        String endElement  =null;
        String elementTxt = null;
        SimpleDateFormat dateFormat = new SimpleDateFormat("MM-dd-yyyy HH:mm:ss");

        while (xsr.hasNext()) {
            int e = xsr.next();
            if(e==XMLStreamConstants.START_ELEMENT){
                //System.out.println("StartElement Name :" + xsr.getLocalName());
                startElement = xsr.getLocalName();
            }
            if(e==XMLStreamConstants.END_ELEMENT){
                //System.out.println("EndElement Name :" + xsr.getLocalName());
                endElement = xsr.getLocalName();
                if(startElement.equalsIgnoreCase(endElement))
                System.out.println(" ElementName : "+ startElement + " ElementText : " + elementTxt);
            }
            if(e==XMLStreamConstants.CHARACTERS){
                //System.out.println("Element TextValue :" + xsr.getText());
                elementTxt = (xsr.getText().contains("\n")) ? "" : xsr.getText();
            }

        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM