简体   繁体   English

SAX 解析器:忽略特殊字符

[英]SAX parser: Ignoring special characters

I'm using Xerces to parse my XML document.我正在使用 Xerces 来解析我的 XML 文档。 The issue is that XML escaped characters like  问题是 XML 转义了像 这样的字符  appear in characters() method as non-escaped ones.作为非转义characters()出现在characters()方法中。 I need to get escaped characters inside characters() method as is.我需要按原样在characters()方法中获取转义字符。

Thanks.谢谢。

UPD: Tried to override resolveEntity() method in my DefaultHandler 's descendant. UPD:试图在我的DefaultHandler的后代中覆盖resolveEntity()方法。 Can see from debug that it's set as entity resolver to XML reader but code from overridden method is not invoked.从调试中可以看到它被设置为 XML 阅读器的实体解析器,但没有调用来自重写方法的代码。

I think your solution is not too bad: a few lines of code to do exactly what you want.我认为您的解决方案还不错:几行代码就可以完全满足您的要求。 The problem is that startEntity and endEntity methods are not provided by ContentHandler interface, so you have to write a LexicalHandler which works in combination with your ContentHandler .问题是ContentHandler接口不提供startEntityendEntity方法,因此您必须编写一个与ContentHandler结合使用的LexicalHandler Usually, the use of an XMLFilter is more elegant, but you have to work with entity, so you still should write a LexicalHandler .通常,使用XMLFilter更优雅,但是您必须使用实体,因此您仍然应该编写LexicalHandler Take a look here for an introduction to the use of SAX filters.查看此处了解 SAX 过滤器的使用介绍。

I'd like to show you a way, very similar to yours, which allows you to separate filtering operations (wrapping & to & for instance) from output operations (or something else).我想向您展示一种与您的方法非常相似的方法,它允许您将过滤操作(例如将 & 包装为& )与输出操作(或其他操作)分开。 I've written my own XMLFilter based on XMLFilterImpl which also implements LexicalHandler interface.我已经基于XMLFilterImpl编写了我自己的XMLFilter ,它也实现了LexicalHandler接口。 This filter contains only the code related to entites escape/unescape.此过滤器仅包含与实体转义/转义相关的代码。

public class XMLFilterEntityImpl extends XMLFilterImpl implements
        LexicalHandler {

    private String currentEntity = null;

    public XMLFilterEntityImpl(XMLReader reader)
            throws SAXNotRecognizedException, SAXNotSupportedException {
        super(reader);
        setProperty("http://xml.org/sax/properties/lexical-handler", this);
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (currentEntity == null) {
            super.characters(ch, start, length);
            return;
        }

        String entity = "&" + currentEntity + ";";
        super.characters(entity.toCharArray(), 0, entity.length());
        currentEntity = null;
    }

    @Override
    public void startEntity(String name) throws SAXException {
        currentEntity = name;
    }

    @Override
    public void endEntity(String name) throws SAXException {
    }

    @Override
    public void startDTD(String name, String publicId, String systemId)
            throws SAXException {
    }

    @Override
    public void endDTD() throws SAXException {
    }

    @Override
    public void startCDATA() throws SAXException {
    }

    @Override
    public void endCDATA() throws SAXException {
    }

    @Override
    public void comment(char[] ch, int start, int length) throws SAXException {
    }
}

And this is my main, with a DefaultHandler as ContentHandler which receives the entity as it is according to the filter code:这是我的主要内容,使用DefaultHandler作为ContentHandler ,它根据过滤器代码接收实体:

public static void main(String[] args) throws ParserConfigurationException,
        SAXException, IOException {

    DefaultHandler defaultHandler = new DefaultHandler() {
        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            //This method receives the entity as is
            System.out.println(new String(ch, start, length));
        }
    };

    XMLFilter xmlFilter = new XMLFilterEntityImpl(XMLReaderFactory.createXMLReader());
    xmlFilter.setContentHandler(defaultHandler);
    String xml = "<html><head><title>title</title></head><body>&amp;</body></html>";
    xmlFilter.parse(new InputSource(new StringReader(xml)));
}

And this is my output:这是我的输出:

title
&amp;

Probably you don't like it, anyway this is an alternative solution.可能你不喜欢它,无论如何这是一个替代解决方案。

I'm sorry, but with SaxParser I think you don't have a more elegant way.我很抱歉,但是对于SaxParser我认为您没有更优雅的方式。

You should also consider switching to StaxParser : it's very easy to do what you want with XMLInputFactory.IS_REPLACING_ENTITY_REFERENCE set to false.您还应该考虑切换到StaxParser :将XMLInputFactory.IS_REPLACING_ENTITY_REFERENCE设置为 false 可以很容易地执行您想要的操作。 If you like this solution, you should take a look here .如果你喜欢这个解决方案,你应该看看这里

If you supply a LexicalHandler as a callback to the SAX parser, it will inform you of the start and end of every entity reference using startEntity() and endEntity() callbacks.如果您提供 LexicalHandler 作为 SAX 解析器的回调,它将使用 startEntity() 和 endEntity() 回调通知您每个实体引用的开始和结束。

(Note that the JavaDoc at http://download.oracle.com/javase/1.5.0/docs/api/org/xml/sax/ext/LexicalHandler.html talks of "entities" when the correct term is "entity references"). (请注意,当正确的术语是“实体引用”时, http://download.oracle.com/javase/1.5.0/docs/api/org/xml/sax/ext/LexicalHandler.html上的 JavaDoc 会谈到“实体” ”)。

Note also that there is no way to get a SAX parser to tell you about numeric character references such as &#x1234;另请注意,没有办法让 SAX 解析器告诉您有关数字字符引用的信息,例如&#x1234; . . Applications are supposed to treat these in exactly the same way as the original character, so you really shouldn't be interested in them.应用程序应该以与原始角色完全相同的方式对待这些,所以你真的不应该对它们感兴趣。

The temporary solution:临时解决办法:

public void startEntity(String name) throws SAXException {
    inEntity = true;
    entityName = name;
}

public void characters(char[] ch, int start, int length) throws SAXException {
    String data;
    if (inEntity) {
        inEntity = false;
        data = "&" + entityName + ";";
    } else {
        data = new String(ch, start, length);
    }
    //TODO do something instead of System.out
    System.out.println(data);
}

But still need elegant solution.但仍然需要优雅的解决方案。

There is one more may: escapeXml method of org.apache.commons.lang.StringEscapeUtils class.还有一个可能: org.apache.commons.lang.StringEscapeUtils类的escapeXml方法。

Try this code in your characters(char[] ch, int start, int length) method:在你的characters(char[] ch, int start, int length)方法中试试这个代码:

String data=new String(ch, start, length);
String escapedData=org.apache.commons.lang.StringEscapeUtils.escapeXml(data);

You may download the jar here .你可以在这里下载 jar。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM