简体   繁体   中英

Leave entities as-is when parsing XML with Woodstox

I'm using Woodstox to process an XML that contains some entities (most notably > ) in the value of one of the nodes. To use an extreme example, it's something like this:

<parent>&nbsp; &lt; &nbsp; &gt; &amp; &quot; &apos; &nbsp;</parent>

I have tried a lot of different configuration options for both WstxInputFactory ( IS_REPLACING_ENTITY_REFERENCES , P_TREAT_CHAR_REFS_AS_ENTS , P_CUSTOM_INTERNAL_ENTITIES ...) and WstxOutputFactory , but no matter what I try, the output is always something like this:

<parent>nbsp; &lt; nbsp; > &amp; " ' nbsp;</parent>

( &gt; gets converted to > , &lt; stays the same, &nbsp; loses the & ...)

I'm reading the XML with an XMLEventReader created with

XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));

after configuring the WstxInputFactory .

Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String?

The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. As far as I know there is no way to get the source of them with Sax.

For the other entities you can process them manually. You can capture the events until the end of the element and concatenate the values:

    XMLInputFactory factory = WstxInputFactory.newInstance();
    factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
    XMLEventReader xmlr = factory.createXMLEventReader(
            this.getClass().getResourceAsStream(xmlFileName));

    String value = "";
    while (xmlr.hasNext()) {
        XMLEvent event = xmlr.nextEvent();
        if (event.isCharacters()) {
            value += event.asCharacters().getData();
        }
        if (event.isEntityReference()) {
            value += "&" + ((EntityReference) event).getName() + ";";
        }
        if (event.isEndElement()) {
            // Assign it to the right variable
            System.out.println(value);
            value = "";
        }
    }

For your example input:

<parent>&nbsp; &lt; &nbsp; &gt; &amp; &quot; &apos; &nbsp;</parent>

The output will be:

&nbsp; < &nbsp; > & " ' &nbsp;

Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities:

public class NaiveHtmlEntityResolver implements XMLResolver {

    private static final Map<String, String> ENTITIES = new HashMap<>();

    static {
        ENTITIES.put("nbsp", " ");
        ENTITIES.put("apos", "'");
        ENTITIES.put("quot", "\"");
        // and so on
    }

    @Override
    public Object resolveEntity(String publicID,
            String systemID,
            String baseURI,
            String namespace) throws XMLStreamException {
        if (publicID == null && systemID == null) {
            return ENTITIES.get(namespace);
        }
        return null;
    }
}

And then tell Woodstox to use it for the undeclared entities:

    factory.setProperty(WstxInputProperties.P_UNDECLARED_ENTITY_RESOLVER, new NaiveHtmlEntityResolver());

First of all, you need to include actual code since "output is always something like this" makes no sense without explaining exactly how are you outputting content that is parsed: you may be printing events, using some library, or perhaps using Woodstox stream or event writer.

Second: there is difference in XML between small number of pre-defined entities ( lt , gt , apos , quot , amp ), and arbitary user-defined entities like what nbsp here would be. Former you can use as-is, they are already defined; latter only exist if you define them in DTD.

Handling of the two groups is different, too; former will always be expanded no matter what, and this is by XML specification. Latter will be resolved (unless resolution disabled), and then expanded -- or if not defined exception will be thrown. You can also specify custom resolver as mention by the other answer; but this will only be used for custom entities (here, &nbsp; ).

In the end it is also good to explain not what you are doing as much as what you are trying to achieve. That will help suggest things better than specific questions of "how do I do X" which may not be the ways to go about.

And as to configuration of Woodstox, maybe this blog entry:

https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

will help (as well as 2 others in the series) -- it covers existing configuration settings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM