简体   繁体   English

Java-读取XML并保留所有实体

[英]Java - Read XML and leave all entities alone

I want to read XHTML files using SAX or StAX, whatever works best. 我想用SAX或StAX读取XHTML文件,最好的方法是。 But I don't want entities to be resolved, replaced or anything like that. 但是我不希望实体被解决,替换或类似的事情。 Ideally they should just remain as they are. 理想情况下,它们应该保持原样。 I don't want to use DTDs. 我不想使用DTD。

Here's an (executable, using Scala 2.8.x) example: 这是一个(可执行的,使用Scala 2.8.x)示例:

import javax.xml.stream._
import javax.xml.stream.events._
import java.io._

println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)

println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
    val event = xer.nextEvent
    if (event.isCharacters) {
        print(event.asCharacters.getData)
    } else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
        entities += event.asInstanceOf[EntityReference].getName
    }
}
println("------")
println("Entities: " + entities.mkString(", "))

Given the following xhtml file ... 给定以下xhtml文件...

<html>
    <head>
        <title>StAX Test</title>
    </head>
    <body>
        <h1>Hallo StAX</h1>
        <p id="html">
            &lt;div class=&quot;header&quot;&gt;
        </p>
        <p id="stuff">
            &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;
        </p>
        Das war's!
    </body>
</html>

... running scala stax-test.scala stax-test.xhtml will result in: ...运行scala stax-test.scala stax-test.xhtml将导致:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      berdies sollte das hier auch als Copyright sichtbar sein: ?

    Das war's!

------
Entities: Uuml

So all entities have been replaced more or less sucessfully. 因此,所有实体都已或多或少成功地被替换了。 What I would have expected and what I want is this, though: 我所期望的和我想要的是:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      &lt;div class=&quot;header&quot;&gt;


      &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;

    Das war's!

------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169

Is this even possible? 这有可能吗? I want to parse XHTML, do some modifications and then output it like that as XHTML again. 我想解析XHTML,进行一些修改,然后再次将其输出为XHTML。 So I really want the entities to remain in the result. 所以我真的希望实体保留在结果中。

Also I don't get why Uuml is reported as an EntityReference event while the rest aren't. 我也不明白为什么Uuml被报告为EntityReference事件,而其余的却没有。

A bit of terminology: &#x169; 有点术语: &#x169; is a numeric character reference (not an entity), and &#auml; 是数字字符引用(不是实体),并且&#auml; is an entity reference (not an entity). 是实体引用(不是实体)。

I don't think any XML parser will report numeric character references to the application - they will always be expanded. 我认为任何XML解析器都不会向应用程序报告数字字符引用-它们将始终被扩展。 Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes. 确实,您的应用程序除了关心属性之间有多少空白外,也不再关心它。

As for entity references, low-level parse interfaces such as SAX will report the existence of the entity reference - at any rate, it reports them when they occur in element content, but not in attribute content. 对于实体引用,低级解析接口(例如SAX)将报告实体引用的存在-无论如何,当它们出现在元素内容而非属性内容中时,它将报告它们。 There are special events notified only to the LexicalHandler rather than to the ContentHandler. 有些特殊事件仅通知LexicalHandler,而不通知给ContentHandler。

The answer to "why Uuml is reported as an EntityReference event while the rest aren't" is that the rest are defined by the XML spec, while &Uuml; 对于“为什么Uuml被报告为EntityReference事件而其余未报告为EntityReference事件”的答案是,其余的由XML规范定义,而&Uuml;则由XML规范定义&Uuml; is specific to HTML 4.0 . 特定于HTML 4.0

Since your goal is to write modified XHTML, it may be possible to force the serializer to emit numeric entity references by setting the "encoding" to "US-ASCII" and/or the "method" to "html". 由于您的目标是编写修改的XHTML,因此可以通过将“ encoding”设置为“ US-ASCII”和/或将“ method”设置为“ html”来强制序列化程序发出数字实体引用。 The XSLT spec (which underlies Java XML serializers) says that the serializer "may output a character using a character entity reference" when the method is html. XSLT规范 (作为Java XML序列化程序的基础)说,当方法为html时,序列化程序“可以使用字符实体引用输出字符”。 Setting the encoding to ASCII may force it to use numeric entities if named entities aren't supported. 如果不支持命名实体,则将编码设置为ASCII可能会强制其使用数字实体。

In Java I would use a regular expression. 在Java中,我将使用正则表达式。

public static void main(String... args) throws IOException {
  BufferedReader buf = new BufferedReader(new FileReader(args[0]));
  Pattern entity = Pattern.compile("&([^;]+);");
  Set<String> entities = new LinkedHashSet<String>();
  for (String line; (line = buf.readLine()) != null; ) {
    Matcher m = entity.matcher(line);
    while (m.find())
      entities.add(m.group(1));
  }
  buf.close();
  System.out.println("Entities: " + entities);
}

prints 版画

Entities: [lt, quot, gt, Uuml, #169]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM