簡體   English   中英

解析XML,如果CDATA不包含HTML標記,則不返回任何字符串

[英]Parsing XML, no string returned if CDATA does not contain an HTML tag

我正在使用DOM解析器來讀取android中的rss提要,例如:

...
<item cbc:type="story" cbc:deptid="2.663" cbc:syndicate="true">
<title>
<![CDATA[
Asian carp have reproduced in Great Lakes watershed
]]>
</title>
<link>
http://www.cbc.ca/news/canada/windsor/asian-carp-have-reproduced-in-great-lakes-watershed-1.2286554?cmp=rss
</link>
<guid isPermaLink="false">1.2286554</guid>
<pubDate>Tue, 29 Oct 2013 08:06:48 EDT</pubDate>
<description>
<![CDATA[
<img title='Fisheries and Oceans Canada and the Ontario Ministry of Natural Resources confirmed one grass carp was caught in the Grand River near Lake Erie. ' height='259' alt='hi-20130502-grass_carp-dfo-852' width='460' src='http://i.cbc.ca/1.1663916.1379078358!/httpImage/image.jpg_gen/derivatives/16x9_460/hi-20130502-grass-carp-dfo-852.jpg' /> <p>Scientists said Monday they have documented for the first time that an Asian carp species has successfully reproduced within the Great Lakes watershed, an ominous development in the struggle to slam the door on the hungry invaders that could threaten native fish.</p>
]]>
</description>
</item>
...

xmlParser.class:

public class xmlParser {

public Document getDomElement(String rssFilePath, String fileName){
    Log.d("GET", ""+rssFilePath+fileName);
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setCoalescing(true);
    FileInputStream fis;
    try {

        DocumentBuilder db = dbf.newDocumentBuilder();

        File tmp2 = new File (rssFilePath,"/"+ fileName);
        fis = new FileInputStream(tmp2);

        InputSource is = new InputSource();
            is.setByteStream(fis);
            doc = db.parse(is); 
        } catch (ParserConfigurationException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (SAXException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (IOException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        }
            // return DOM
   // Log.d("DOM", doc.toString());
        return doc;

}

public String getValue(Element item, String str) { 
    NodeList n = item.getElementsByTagName(str);        
    return this.getElementValue(n.item(0));
}

public final String getElementValue( Node elem ) {
         Node child;
         if( elem != null){
             if (elem.hasChildNodes()){
                 for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
                     if( child.getNodeType() == Node.TEXT_NODE  ){
                         return child.getNodeValue();
                     }
                 }
             }
         }
         return "";
  } 
}

從我的主要活動:

//Parse the XML content
            xmlParser parser = new xmlParser();
            Log.d(TAG, "1");
            Document rssDoc = parser.getDomElement(rssFilePath, rssFileName);
            Log.d(TAG, "2");
            final NodeList nl = rssDoc.getElementsByTagName(KEY_ITEM);
            Log.d(TAG, "3");

            //Make it all look nice and strip HTML
            for (int i = 0; i < nl.getLength(); i++){

                Element e = (Element) nl.item(i);

                String noHtmlTitle = parser.getValue(e, KEY_TITLE).toString().replaceAll("\\<.*?>", "");
                noHtmlTitle = noHtmlTitle.replaceAll("/n", "");

                noHtmlTitle = noHtmlTitle.trim();

                titles.add(noHtmlTitle);

                String noHtmlDesc = parser.getValue(e, KEY_DESC).toString().replaceAll("\\<.*?>", "");
                noHtmlDesc = noHtmlDesc.trim(); 
                descs.add("\n" + noHtmlDesc);

            }

但是,當此代碼與上述“title”“/ title”標記一起顯示時,它將返回一個空白字符串。 這似乎與“標題”標簽不包含任何HTML標簽的事實有關。

如何從標題標簽中檢索可用的字符串?

如果我已排除任何所需數據,請與我們聯系。

編輯:

根據blahdiblah,返回的數據類型是CDATA_SECTION_NODE。 我修改了getElementValue方法以包含此數據類型:

public final String getElementValue( Node elem ) {
         Node child;
         if( elem != null){
             if (elem.hasChildNodes()){
                 for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
                     if( child.getNodeType() == Node.TEXT_NODE  ){
                         return child.getNodeValue();
                     }else if (child.getNodeType() == Node.CDATA_SECTION_NODE){
                         return child.getNodeValue();
                     }
                 }
             }
         }
         return "";
  } 

您的XMLParser僅返回文本節點的內容( child.getNodeType() == Node.TEXT_NODE ),但<title>的類型為CDATA_SECTION_NODE

請注意,標題幾乎肯定是作為CDATA而不是純文本發送的,因此它可以包含HTML格式和其他奇怪的字符。 確保使用各種輸入進行測試,以確保正確解析它。

去掉

[英]remove <![CDATA[ tag from xml webserivce responses

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM