简体   繁体   English

使用Woodstox解析XML时,保持实体不变

[英]Leave entities as-is when parsing XML with Woodstox

I'm using Woodstox to process an XML that contains some entities (most notably > ) in the value of one of the nodes. 我正在使用Woodstox处理一个XML,其中包含一个节点的值中的某些实体(最值得注意的是> )。 To use an extreme example, it's something like this: 使用一个极端的例子,它是这样的:

<parent>&nbsp; &lt; &nbsp; &gt; &amp; &quot; &apos; &nbsp;</parent>

I have tried a lot of different configuration options for both WstxInputFactory ( IS_REPLACING_ENTITY_REFERENCES , P_TREAT_CHAR_REFS_AS_ENTS , P_CUSTOM_INTERNAL_ENTITIES ...) and WstxOutputFactory , but no matter what I try, the output is always something like this: 我为WstxInputFactoryIS_REPLACING_ENTITY_REFERENCESP_TREAT_CHAR_REFS_AS_ENTSP_CUSTOM_INTERNAL_ENTITIES ...)和WstxOutputFactory尝试了很多不同的配置选项,但无论我尝试什么,输出总是这样:

<parent>nbsp; &lt; nbsp; > &amp; " ' nbsp;</parent>

( &gt; gets converted to > , &lt; stays the same, &nbsp; loses the & ...) &gt;转换为>&lt;保持不变, &nbsp;失去& ...)

I'm reading the XML with an XMLEventReader created with 我正在使用创建的XMLEventReader读取XML

XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));

after configuring the WstxInputFactory . 配置WstxInputFactory后

Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String? 是否有任何方法可以将Woodstox配置为忽略所有实体并输出与输入String中的文本完全相同的文本?

The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. 将始终处理基本的五个XML实体(quot,amp,apos,lt,gt)。 As far as I know there is no way to get the source of them with Sax. 据我所知,没有办法用萨克斯获得它们的来源。

For the other entities you can process them manually. 对于其他实体,您可以手动处理它们。 You can capture the events until the end of the element and concatenate the values: 您可以捕获事件直到元素结束并连接值:

    XMLInputFactory factory = WstxInputFactory.newInstance();
    factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
    XMLEventReader xmlr = factory.createXMLEventReader(
            this.getClass().getResourceAsStream(xmlFileName));

    String value = "";
    while (xmlr.hasNext()) {
        XMLEvent event = xmlr.nextEvent();
        if (event.isCharacters()) {
            value += event.asCharacters().getData();
        }
        if (event.isEntityReference()) {
            value += "&" + ((EntityReference) event).getName() + ";";
        }
        if (event.isEndElement()) {
            // Assign it to the right variable
            System.out.println(value);
            value = "";
        }
    }

For your example input: 对于您的示例输入:

<parent>&nbsp; &lt; &nbsp; &gt; &amp; &quot; &apos; &nbsp;</parent>

The output will be: 输出将是:

&nbsp; < &nbsp; > & " ' &nbsp;

Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities: 否则,如果要转换所有实体,可以使用自定义XmlResolver用于未声明的实体:

public class NaiveHtmlEntityResolver implements XMLResolver {

    private static final Map<String, String> ENTITIES = new HashMap<>();

    static {
        ENTITIES.put("nbsp", " ");
        ENTITIES.put("apos", "'");
        ENTITIES.put("quot", "\"");
        // and so on
    }

    @Override
    public Object resolveEntity(String publicID,
            String systemID,
            String baseURI,
            String namespace) throws XMLStreamException {
        if (publicID == null && systemID == null) {
            return ENTITIES.get(namespace);
        }
        return null;
    }
}

And then tell Woodstox to use it for the undeclared entities: 然后告诉Woodstox将其用于未申报的实体:

    factory.setProperty(WstxInputProperties.P_UNDECLARED_ENTITY_RESOLVER, new NaiveHtmlEntityResolver());

First of all, you need to include actual code since "output is always something like this" makes no sense without explaining exactly how are you outputting content that is parsed: you may be printing events, using some library, or perhaps using Woodstox stream or event writer. 首先,你需要包含实际代码,因为“输出总是这样的东西”没有任何意义,如果没有准确解释你如何输出被解析的内容:你可能正在打印事件,使用某些库,或者可能使用Woodstox流或事件作家。

Second: there is difference in XML between small number of pre-defined entities ( lt , gt , apos , quot , amp ), and arbitary user-defined entities like what nbsp here would be. 第二:有之间少数预先定义的实体(在XML差别ltgtaposquotamp ),和arbitary用户定义实体像什么nbsp这里会。 Former you can use as-is, they are already defined; 以前你可以按原样使用,它们已经定义; latter only exist if you define them in DTD. 后者仅在您在DTD中定义它们时才存在。

Handling of the two groups is different, too; 两组的处理也不同; former will always be expanded no matter what, and this is by XML specification. 无论如何都会扩展前者,这是XML规范。 Latter will be resolved (unless resolution disabled), and then expanded -- or if not defined exception will be thrown. 后者将被解析(除非禁用解析),然后扩展 - 或者如果没有定义,将抛出异常。 You can also specify custom resolver as mention by the other answer; 您也可以通过其他答案指定自定义解析器; but this will only be used for custom entities (here, &nbsp; ). 但这只会用于自定义实体(此处为&nbsp; )。

In the end it is also good to explain not what you are doing as much as what you are trying to achieve. 最后,不要解释你正在做的事情和你想要实现的目标一样好。 That will help suggest things better than specific questions of "how do I do X" which may not be the ways to go about. 这将有助于提出比“我如何做X”的具体问题更好的建议,这可能不是可行的方法。

And as to configuration of Woodstox, maybe this blog entry: 至于Woodstox的配置,可能是这篇博文:

https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173 https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

will help (as well as 2 others in the series) -- it covers existing configuration settings. 将有所帮助(以及该系列中的其他2个) - 它涵盖了现有的配置设置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM