如何跟踪StAX中大文件的解析进度？

Question

I'm processing large (1TB) XML files using the StAX API. 我正在使用StAX API处理大型（1TB）XML文件。 Let's assume we have a loop handling some elements: 假设我们有一个处理某些元素的循环：

XMLInputFactory fac = XMLInputFactory.newInstance();
 XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
   while (true) {
       if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
            // handle contents
       }
}

How do I keep track of overall progress within the large XML file? 如何跟踪大型XML文件中的总体进度？ Fetching the offset from reader works fine for smaller files: 对于较小的文件，从阅读器获取偏移量效果很好：

int offset = reader.getLocation().getCharacterOffset();

but being an Integer offset, it'll probably only work for files up to 2GB... 但作为Integer偏移量，它可能仅适用于最大2GB的文件...

Answer 1

A simple FilterReader should work. 一个简单的FilterReader应该可以工作。

class ProgressCounter extends FilterReader {
    long progress = 0;

    @Override
    public long skip(long n) throws IOException {
        progress += n;
        return super.skip(n);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int red = super.read(cbuf, off, len);
        progress += red;
        return red;
    }

    @Override
    public int read() throws IOException {
        int red = super.read();
        progress += red;
        return red;
    }

    public ProgressCounter(Reader in) {
        super(in);
    }

    public long getProgress () {
        return progress;
    }
}

Answer 2

Seems that the Stax API can't give you a long offset. 似乎Stax API无法为您提供long补偿。

As a workaround you could create a custom java.io.FilterReader class which overrides read() and read(char[] cbuf, int off, int len) to increment a long offset. 作为一种解决方法，您可以创建一个自定义java.io.FilterReader类，该类重写read()和read(char[] cbuf, int off, int len)以增加long偏移量。

You would pass this reader to the XMLInputFactory . 您可以将此阅读器传递给XMLInputFactory 。 The handler loop can then get the offset information directly from the reader. 然后，处理程序循环可以直接从阅读器获取偏移信息。

You could also do this on the byte-level reading using a FilterInputStream , counting the byte offset instead of character offset. 您也可以使用FilterInputStream在字节级读取时执行此操作，计算字节偏移量而不是字符偏移量。 That would allow for a exact progress calculation given the file size. 给定文件大小，这将允许进行精确的进度计算。

如何跟踪StAX中大文件的解析进度？

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-01-11 15:09:18

解决方案2
1 2016-01-11 15:06:52

如何跟踪StAX中大文件的解析进度？

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-01-11 15:09:18

解决方案2 1 2016-01-11 15:06:52

解决方案1
3 已采纳 2016-01-11 15:09:18

解决方案2
1 2016-01-11 15:06:52