简体   繁体   English

JAVA:使用 XmlStreamReader 收集 xml 标签的字节偏移量

[英]JAVA: gathering byte offsets of xml tags using an XmlStreamReader

Is there a way to accurately gather the byte offsets of xml tags using the XMLStreamReader?有没有办法使用 XMLStreamReader 准确地收集 xml 标签的字节偏移量?

I have a large xml file that I require random access to.我有一个需要随机访问的大型 xml 文件。 Rather than writing the whole thing to a database, I would like to run through it once with an XMLStreamReader to gather the byte offsets of significant tags, and then be able to use a RandomAccessFile to retrieve the tag content later.我不想将整个内容写入数据库,而是希望使用 XMLStreamReader 运行一次以收集重要标签的字节偏移量,然后能够使用 RandomAccessFile 稍后检索标签内容。

XMLStreamReader doesn't seem to have a way to track character offsets. XMLStreamReader 似乎没有办法跟踪字符偏移。 Instead people recommend attaching the XmlStreamReader to a reader that tracks how many bytes have been read (the CountingInputStream provided by apache.commons.io , for example)相反,人们建议将 XmlStreamReader 附加到跟踪已读取字节数的读取器(例如apache.commons.io提供的 CountingInputStream )

eg:例如:

CountingInputStream countingReader = new CountingInputStream(new FileInputStream(xmlFile)) ;
XMLStreamReader xmlStreamReader = xmlStreamFactory.createXMLStreamReader(countingReader, "UTF-8") ;


while (xmlStreamReader.hasNext()) {
    int eventCode = xmlStreamReader.next();

    switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " @" + countingReader.getByteCount()) ;
    }

}
xmlStreamReader.close();

Unfortunately there must be some buffering going on, because the above code prints out the same byte offsets for several tags.不幸的是,一定有一些缓冲在进行,因为上面的代码打印出几个标签的相同字节偏移量。 Is there a more accurate way of tracking byte offsets in xml files (ideally without resorting to abandoning proper xml parsing)?是否有更准确的方法来跟踪 xml 文件中的字节偏移量(理想情况下无需放弃正确的 xml 解析)?

You could use getLocation() on the XMLStreamReader (or XMLEvent.getLocation() if you use XMLEventReader), but I remember reading somewhere that it is not reliable and precise.您可以在 XMLStreamReader 上使用 getLocation()(或 XMLEvent.getLocation(),如果您使用 XMLEventReader),但我记得在某处读到它不可靠和精确。 And it looks like it gives the endpoint of the tag, not the starting location.看起来它给出了标签的端点,而不是起始位置。

I have a similar need to precisely know the location of tags within a file, and I'm looking at other parsers to see if there is one that guarantees to give the necessary level of location precision.我也有类似的需求,需要精确地知道文件中标签的位置,我正在查看其他解析器,看看是否有一个可以保证提供必要级别的位置精度的解析器。

Unfortunatly Aalto doesn't implement the LocationInfo interface.不幸的是,Aalto 没有实现 LocationInfo 接口。

The last java VTD-XML ximpleware implementation, currently 2.11 on sourceforge or on github provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.最后一个 java VTD-XML ximpleware 实现,目前在sourceforgegithub上的 2.11 提供了一些代码,在每次调用其 IReader 实现的 getChar() 方法后维护一个字节偏移量。

IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java各种字符编码的 IReader 实现在 VTDGen.java 和 VTDGenHuge.java 中可用

IReader implementations are provided for the following encodings为以下编码提供了 IReader 实现

ASCII;
ISO_8859_1
ISO_8859_10
ISO_8859_11
ISO_8859_12
ISO_8859_13
ISO_8859_14
ISO_8859_15
ISO_8859_16
ISO_8859_2
ISO_8859_3
ISO_8859_4
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
ISO_8859_9
UTF_16BE
UTF_16LE
UTF8;   
WIN_1250
WIN_1251
WIN_1252
WIN_1253
WIN_1254
WIN_1255
WIN_1256
WIN_1257
WIN_1258

Updating IReader with a getCharOffset() method and implementing it by adding a charCount member along to the offset member of the VTDGen and VTDGenHuge classes and by incrementing it upon each getChar() and skipChar() call of each IReader implementation should give you the start of a solution.使用 getCharOffset() 方法更新 IReader 并通过向 VTDGen 和 VTDGenHuge 类的偏移成员添加 charCount 成员并在每个 IReader 实现的每个 getChar() 和 skipChar() 调用时增加它来实现它应该给你开始的一个解决方案。

您可以在实际输入流周围使用包装输入流,只需将实际 I/O 操作推迟到包装流,但保留带有各种代码的内部计数机制来检索当前偏移量?

I think I've found another option.我想我找到了另一种选择。 If you replace your switch block with the following, it will dump the position immediately after the end element tag.如果用以下内容替换switch块,它将在结束元素标记之后立即转储位置。

        switch (eventCode) {
        case XMLStreamReader.END_ELEMENT :
            System.out.println(xmlStreamReader.getLocalName() + " end@" + xmlStreamReader.getLocation().getCharacterOffset()) ;
        }

This solution also would require that the actual start position of the end tags would have to be manually calculated, and would have the advantage of not needing an external JAR file.该解决方案还需要手动计算结束标记的实际开始位置,并且具有不需要外部 JAR 文件的优点。

I was not able to track down some minor inconsistencies in the data management (I think it has to do with how I initialized my XMLStreamReader ), but I always saw a consistent increase in the location as the reader moved through the content.我无法追踪数据管理中的一些小不一致(我认为这与我初始化XMLStreamReader ),但我总是看到随着阅读器在内容中移动时位置的一致增加。

Hope this helps!希望这可以帮助!

I recently worked out a solution for a similar question on How to find character offsets in big XML files using java?我最近为类似的问题制定了一个解决方案,即如何使用 java 在大 XML 文件中查找字符偏移? . . I think it provides a good solution based on a ANTLR generated XML-Parser.我认为它基于 ANTLR 生成的 XML-Parser 提供了一个很好的解决方案。

I just burned a day long weekend on this, and arrived at the solution partially thanks to some clues here.我只是烧了这样的日子长周末,和赶到的解决方案部分归功于这里一些线索。 Remarkably I don't think this has gotten much easier in the 10 years since the OP posted this question.值得注意的是,自 OP 发布此问题以来的 10 年里,我认为这并没有变得容易得多。

TL;DR Use Woodstox and char offsets TL;DR 使用Woodstox和 char 偏移

The first problem to contend with is that most XMLStreamReader implementations seem to provide inaccurate results when you ask them for their current offsets.要解决的第一个问题是大多数 XMLStreamReader 实现在您询问它们当前的偏移量时似乎提供不准确的结果。 Woodstox however seems to be rock-solid in this regard.然而,伍德斯托克斯在这方面似乎坚如磐石。

The second problem is the actual type of offset you use.第二个问题是您使用的实际偏移类型。 Unfortunately it seems that you have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset, then start extracting.不幸的是,如果您需要使用多字节字符集,似乎您必须使用字符偏移量,这意味着从文件中随机访问检索的效率不会很高 - 您不能只将指针设置为文件在您的偏移量并开始读取,您必须通读直到到达偏移量,然后开始提取。 There may be a more efficient way to do this that I haven't though of, but the performance is acceptable for my case.可能有一种更有效的方法来做到这一点,但我还没有想到,但对于我的情况来说,性能是可以接受的。 500MB files are pretty snappy. 500MB 的文件非常活泼。

[edit] So this turned into one of those splinter-in-my-mind things, and I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. [编辑] 所以这变成了我脑海中那些碎片之一,我最终编写了一个 FilterReader,它在读取文件时保留字节偏移量到字符偏移量映射的缓冲区。 When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset.当我们需要获取字节偏移量时,我们首先向 Woodstox 询问字符偏移量,然后让自定义读取器告诉我们字符偏移量的实际字节偏移量。 We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile.我们可以从元素的开头和结尾获取字节偏移量,为我们提供我们需要进入的内容,并通过将其作为 RandomAccessFile 打开来手术地从文件中提取元素。

I created a library for this, it's on GitHub and Maven Central .我为此创建了一个库,它位于GitHubMaven Central 上 If you just want to get the important bits, the party trick is in the ByteTrackingReader .如果你只是想获得重要的部分,派对技巧就在ByteTrackingReader 中 [/edit] [/编辑]

There is another similar question on SO about this (but the accepted answer frightened and confused me), and some people commented about how this whole thing is a bad idea and why would you want to do it? 关于这个还有另一个类似的问题(但接受的答案让我感到害怕和困惑),有些人评论说这整件事是个坏主意,你为什么要这样做? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. XML 是一种传输机制,您应该将其导入数据库并使用更合适的工具处理数据。 For most cases this is true, but if you're building applications or integrations that communicate via XML (still going strong in 2020), you need tooling to analyze and operate on the files that are exchanged.在大多数情况下,这是正确的,但如果您正在构建通过 XML 进行通信的应用程序或集成(在 2020 年仍然很强大),您需要工具来分析和操作交换的文件。 I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.我每天都会收到验证提要内容的请求,能够从大量文件中快速提取一组特定的项目,不仅验证内容,而且格式本身也是必不可少的。

Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.无论如何,希望这可以为某人节省几个小时,或者至少让他们更接近解决方案。 God help you if you're finding this in 2030, trying to solve the same problem.如果你在 2030 年发现这个问题,试图解决同样的问题,上帝会帮助你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM