[英]How do I keep track of parsing progress of large files in StAX?
I'm processing large (1TB) XML files using the StAX API. 我正在使用StAX API处理大型(1TB)XML文件。 Let's assume we have a loop handling some elements:
假设我们有一个处理某些元素的循环:
XMLInputFactory fac = XMLInputFactory.newInstance();
XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
while (true) {
if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
// handle contents
}
}
How do I keep track of overall progress within the large XML file? 如何跟踪大型XML文件中的总体进度? Fetching the offset from reader works fine for smaller files:
对于较小的文件,从阅读器获取偏移量效果很好:
int offset = reader.getLocation().getCharacterOffset();
but being an Integer offset, it'll probably only work for files up to 2GB... 但作为Integer偏移量,它可能仅适用于最大2GB的文件...
A simple FilterReader
should work. 一个简单的
FilterReader
应该可以工作。
class ProgressCounter extends FilterReader {
long progress = 0;
@Override
public long skip(long n) throws IOException {
progress += n;
return super.skip(n);
}
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int red = super.read(cbuf, off, len);
progress += red;
return red;
}
@Override
public int read() throws IOException {
int red = super.read();
progress += red;
return red;
}
public ProgressCounter(Reader in) {
super(in);
}
public long getProgress () {
return progress;
}
}
Seems that the Stax API can't give you a long
offset. 似乎Stax API无法为您提供
long
补偿。
As a workaround you could create a custom java.io.FilterReader
class which overrides read()
and read(char[] cbuf, int off, int len)
to increment a long
offset. 作为一种解决方法,您可以创建一个自定义
java.io.FilterReader
类,该类重写read()
和read(char[] cbuf, int off, int len)
以增加long
偏移量。
You would pass this reader to the XMLInputFactory
. 您可以将此阅读器传递给
XMLInputFactory
。 The handler loop can then get the offset information directly from the reader. 然后,处理程序循环可以直接从阅读器获取偏移信息。
You could also do this on the byte-level reading using a FilterInputStream
, counting the byte offset instead of character offset. 您也可以使用
FilterInputStream
在字节级读取时执行此操作,计算字节偏移量而不是字符偏移量。 That would allow for a exact progress calculation given the file size. 给定文件大小,这将允许进行精确的进度计算。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.