Java-读取XML并保留所有实体

Question

I want to read XHTML files using SAX or StAX, whatever works best. 我想用SAX或StAX读取XHTML文件，最好的方法是。 But I don't want entities to be resolved, replaced or anything like that. 但是我不希望实体被解决，替换或类似的事情。 Ideally they should just remain as they are. 理想情况下，它们应该保持原样。 I don't want to use DTDs. 我不想使用DTD。

Here's an (executable, using Scala 2.8.x) example: 这是一个（可执行的，使用Scala 2.8.x）示例：

import javax.xml.stream._
import javax.xml.stream.events._
import java.io._

println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)

println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
    val event = xer.nextEvent
    if (event.isCharacters) {
        print(event.asCharacters.getData)
    } else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
        entities += event.asInstanceOf[EntityReference].getName
    }
}
println("------")
println("Entities: " + entities.mkString(", "))

Given the following xhtml file ... 给定以下xhtml文件...

<html>
    <head>
        <title>StAX Test</title>
    </head>
    <body>
        <h1>Hallo StAX</h1>
        <p id="html">
            &lt;div class=&quot;header&quot;&gt;
        </p>
        <p id="stuff">
            &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;
        </p>
        Das war's!
    </body>
</html>

... running scala stax-test.scala stax-test.xhtml will result in: ...运行scala stax-test.scala stax-test.xhtml将导致：

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      berdies sollte das hier auch als Copyright sichtbar sein: ?

    Das war's!

------
Entities: Uuml

So all entities have been replaced more or less sucessfully. 因此，所有实体都已或多或少成功地被替换了。 What I would have expected and what I want is this, though: 我所期望的和我想要的是：

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      &lt;div class=&quot;header&quot;&gt;


      &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;

    Das war's!

------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169

Is this even possible? 这有可能吗？ I want to parse XHTML, do some modifications and then output it like that as XHTML again. 我想解析XHTML，进行一些修改，然后再次将其输出为XHTML。 So I really want the entities to remain in the result. 所以我真的希望实体保留在结果中。

Also I don't get why Uuml is reported as an EntityReference event while the rest aren't. 我也不明白为什么Uuml被报告为EntityReference事件，而其余的却没有。

Answer 1

A bit of terminology: ũ 有点术语： ũ is a numeric character reference (not an entity), and &#auml; 是数字字符引用（不是实体），并且&#auml; is an entity reference (not an entity). 是实体引用（不是实体）。

I don't think any XML parser will report numeric character references to the application - they will always be expanded. 我认为任何XML解析器都不会向应用程序报告数字字符引用-它们将始终被扩展。 Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes. 确实，您的应用程序除了关心属性之间有多少空白外，也不再关心它。

As for entity references, low-level parse interfaces such as SAX will report the existence of the entity reference - at any rate, it reports them when they occur in element content, but not in attribute content. 对于实体引用，低级解析接口（例如SAX）将报告实体引用的存在-无论如何，当它们出现在元素内容而非属性内容中时，它将报告它们。 There are special events notified only to the LexicalHandler rather than to the ContentHandler. 有些特殊事件仅通知LexicalHandler，而不通知给ContentHandler。

Answer 2

The answer to "why Uuml is reported as an EntityReference event while the rest aren't" is that the rest are defined by the XML spec, while Ü 对于“为什么Uuml被报告为EntityReference事件而其余未报告为EntityReference事件”的答案是，其余的由XML规范定义，而Ü则由XML规范定义Ü is specific to HTML 4.0 . 特定于HTML 4.0 。

Since your goal is to write modified XHTML, it may be possible to force the serializer to emit numeric entity references by setting the "encoding" to "US-ASCII" and/or the "method" to "html". 由于您的目标是编写修改的XHTML，因此可以通过将“ encoding”设置为“ US-ASCII”和/或将“ method”设置为“ html”来强制序列化程序发出数字实体引用。 The XSLT spec (which underlies Java XML serializers) says that the serializer "may output a character using a character entity reference" when the method is html. XSLT规范（作为Java XML序列化程序的基础）说，当方法为html时，序列化程序“可以使用字符实体引用输出字符”。 Setting the encoding to ASCII may force it to use numeric entities if named entities aren't supported. 如果不支持命名实体，则将编码设置为ASCII可能会强制其使用数字实体。

Answer 3

In Java I would use a regular expression. 在Java中，我将使用正则表达式。

public static void main(String... args) throws IOException {
  BufferedReader buf = new BufferedReader(new FileReader(args[0]));
  Pattern entity = Pattern.compile("&([^;]+);");
  Set<String> entities = new LinkedHashSet<String>();
  for (String line; (line = buf.readLine()) != null; ) {
    Matcher m = entity.matcher(line);
    while (m.find())
      entities.add(m.group(1));
  }
  buf.close();
  System.out.println("Entities: " + entities);
}

prints 版画

Entities: [lt, quot, gt, Uuml, #169]

Java-读取XML并保留所有实体

问题描述

3 个解决方案

解决方案1
2 2011-09-12 12:26:59

解决方案2
1 2011-09-12 11:59:49

解决方案3
-2 2011-09-12 09:52:29

Java-读取XML并保留所有实体

问题描述

3 个解决方案

解决方案1 2 2011-09-12 12:26:59

解决方案2 1 2011-09-12 11:59:49

解决方案3 -2 2011-09-12 09:52:29

解决方案1
2 2011-09-12 12:26:59

解决方案2
1 2011-09-12 11:59:49

解决方案3
-2 2011-09-12 09:52:29