简体   繁体   English

SAX的character()方法无法解析Text节点中的特殊字符

[英]Special characters in Text node not getting parsed by SAX's characters() method

I'm making an android application in which I'm parsing an XML using SAX parser. 我正在制作一个Android应用程序,在其中使用SAX解析器解析XML。

In the XML there is tag: 在XML中有标签:

<title>Deals &amp; Dealmakers: Technology, media and communications M&amp;A </title>

As you can see it contains some special charters like &amp; 如您所见,其中包含一些特殊的章程,例如&amp;

The issue is I'm using SAX's implicit method: 问题是我正在使用SAX的隐式方法:

@Override
public void characters(char[] ch, int start, int length) throws SAXException{}

Here, the parameter 'char[] ch' is supposed to fetch the entire line Deals &amp; Dealmakers: Technology, media and communications M&amp;A 在这里,参数'char [] ch'应该能够获取整行Deals &amp; Dealmakers: Technology, media and communications M&amp;A Deals &amp; Dealmakers: Technology, media and communications M&amp;A But it is only getting "Deals ". Deals &amp; Dealmakers: Technology, media and communications M&amp;A但它只是“交易”。

How can I solve this issue? 我该如何解决这个问题?

One issue might be because of the way I'm passing the XML to the SAX parser. 一个问题可能是由于我将XML传递给SAX解析器的方式。 Do I need to change the encoding or format? 我需要更改编码或格式吗?

Currently, I'm passing the XML as InputStream & using the below code: 当前,我将XML作为InputStream传递并使用以下代码:

HttpResponse httpResponse = utils.sendRequestAndGetHTTPResponse(URL);
if (httpResponse.getStatusLine().getStatusCode() == 200) {
    HttpEntity entity = httpResponse.getEntity();
    InputStream in = entity.getContent();
    parseResponse(in);
}


// Inside parseResponse method:
try {
    SAXParserFactory spf = SAXParserFactory.newInstance();
    SAXParser sp = spf.newSAXParser();
    XMLReader xmlReader = sp.getXMLReader();

    MyHandler handler = new MyHandler();
    xmlReader.setContentHandler(handler);
    xmlReader.parse(new InputSource(in));
} catch (Exception e) {
}

Here, the parameter 'char[] ch' is supposed to fetch the entire line Deals & Dealmakers: Technology, media and communications M&A But it is only getting "Deals ". 在这里,参数'char [] ch'应该能够获取整行Deal&Dealmakers:技术,媒体和通信并购,但它只会得到“ Deals”。

You seem to be assuming that you'll get the whole text in one call. 您似乎以为一次呼叫就能获得全部文本。 There's no guarantee of that. 没有保证。 I strongly suspect that your characters method will be called multiple times for the same text node, which is valid for the parser to do. 我强烈怀疑,对于同一文本节点,将多次调用您的characters方法,这对于解析器而言是有效的。 You need to make sure your code handles that. 您需要确保您的代码能够处理该问题。

From the documentation : 文档中

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; SAX解析器可以在单个块中返回所有连续的字符数据,也可以将其拆分为几个块。 however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information. 但是,任何单个事件中的所有字符都必须来自同一外部实体,以便定位器提供有用的信息。

There may be a feature you can set to ensure you get all the data in one go; 可能是你可以设置,以确保你得到一个去所有的数据的功能; I'm not sure. 我不确定。

I guess UTF-8 is exactly the problem . 我想UTF-8正是问题所在。 In the file,you parsing the encoding is defined as ISO-8859-1 在文件中,您解析的编码定义为ISO-8859-1

so just try following code: 因此,只需尝试以下代码:

InputSource is = new InputSource(yourInputStream);
is.setEncoding("ISO-8859-1");
xmlReader.parse(is);

hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM