简体   繁体   English

Xml解析问题

[英]Xml parsing problem

I'm getting some text from an xml file 我从xml文件中获取了一些文本

URL url_Twitter = new URL("http://twitter.com/statuses/user_timelineID_PROVA.rss"); 
HttpURLConnection conn_Twitter =(HttpURLConnection)url_Twitter.openConnection();   

DocumentBuilderFactory documentBF_Twitter = DocumentBuilderFactory.newInstance();            
DocumentBuilder documentB_Twitter = documentBF_Twitter.newDocumentBuilder();    
Document document_Twitter = documentB_Twitter.parse( conn_Twitter.getInputStream());  

in the xml there are some characters like &# 8217; 在xml中有一些像’ so when I call 所以我打电话的时候

document_Twitter.getElementsByTagName("title").item(2).getFirstChild().getNodeValue()

the string are trunked before that kind of characters 字符串在那种字符之前被中继

The text is in just one tag 文本只有一个标签

  <item>
    <title>SMWRME: Internet per &#8220;Collaborare senza confini&#8221;. Soprattutto alla SMW di Roma, dal 7 all'11 febbraio. Ecco il terzo percorso. http://cot.ag/ewnJ4F</title>
    <description>SMWRME: Internet per &#8220;Collaborare senza confini&#8221;. Soprattutto alla SMW di Roma, dal 7 all'11 febbraio. Ecco il terzo percorso. http://cot.ag/ewnJ4F</description>
    <pubDate>Mon, 27 Dec 2010 20:05:01 +0000</pubDate>
    <guid>http://twitter.com/SMWRME/statuses/19483914259140609</guid>
    <link>http://twitter.com/SMWRME/statuses/19483914259140609</link>
    <twitter:source>&lt;a href=&quot;http://cotweet.com/?utm_source=sp1&quot; rel=&quot;nofollow&quot;&gt;CoTweet&lt;/a&gt;</twitter:source>
    <twitter:place/>
  </item>

I noticed that this behavior does happen just for android application. 我注意到这种行为确实只发生在Android应用程序中。 The same code works fine for a java application. 相同的代码适用于Java应用程序。 Can someone help me? 有人能帮我吗?

Can you try document_Twitter.getElementsByTagName("title").item(2).getTextContent() instead? 你能尝试一下document_Twitter.getElementsByTagName("title").item(2).getTextContent()吗? There might actually be multiple text nodes beneath this node, like 这个节点下可能实际上有多个文本节点,比如

- "item" element
  - "title" element
    - text node "SMWRME: Internet per "
    - text node "&#8220;"
    - text node "Collaborare senza confini"
    - text node "&#8221;"

Most SAX parsers would deliver the character content split in multiple parts so I can imagine a DOM parser to do the same. 大多数SAX解析器会提供分成多个部分的字符内容,因此我可以想象一个DOM解析器也可以这样做。 The method getTextContent should return the text content of all sub sub nodes concatenated. getTextContent方法应该返回连接的所有子子节点的文本内容。

You could also try to call setCoalescing(true) on your DocumentBuilderFactory before creating the DocumentBuilder, the documentation mentions that this affects CDATA sections but it might also change the handling of character entities. 您还可以在创建DocumentBuilder之前尝试在DocumentBuilderFactory上调用setCoalescing(true) ,文档提到这会影响CDATA部分,但它也可能会更改字符实体的处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM