简体   繁体   中英

Parsing XML, no string returned if CDATA does not contain an HTML tag

I am using a DOM parser to read rss feeds such as this one within android:

...
<item cbc:type="story" cbc:deptid="2.663" cbc:syndicate="true">
<title>
<![CDATA[
Asian carp have reproduced in Great Lakes watershed
]]>
</title>
<link>
http://www.cbc.ca/news/canada/windsor/asian-carp-have-reproduced-in-great-lakes-watershed-1.2286554?cmp=rss
</link>
<guid isPermaLink="false">1.2286554</guid>
<pubDate>Tue, 29 Oct 2013 08:06:48 EDT</pubDate>
<description>
<![CDATA[
<img title='Fisheries and Oceans Canada and the Ontario Ministry of Natural Resources confirmed one grass carp was caught in the Grand River near Lake Erie. ' height='259' alt='hi-20130502-grass_carp-dfo-852' width='460' src='http://i.cbc.ca/1.1663916.1379078358!/httpImage/image.jpg_gen/derivatives/16x9_460/hi-20130502-grass-carp-dfo-852.jpg' /> <p>Scientists said Monday they have documented for the first time that an Asian carp species has successfully reproduced within the Great Lakes watershed, an ominous development in the struggle to slam the door on the hungry invaders that could threaten native fish.</p>
]]>
</description>
</item>
...

xmlParser.class:

public class xmlParser {

public Document getDomElement(String rssFilePath, String fileName){
    Log.d("GET", ""+rssFilePath+fileName);
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setCoalescing(true);
    FileInputStream fis;
    try {

        DocumentBuilder db = dbf.newDocumentBuilder();

        File tmp2 = new File (rssFilePath,"/"+ fileName);
        fis = new FileInputStream(tmp2);

        InputSource is = new InputSource();
            is.setByteStream(fis);
            doc = db.parse(is); 
        } catch (ParserConfigurationException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (SAXException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (IOException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        }
            // return DOM
   // Log.d("DOM", doc.toString());
        return doc;

}

public String getValue(Element item, String str) { 
    NodeList n = item.getElementsByTagName(str);        
    return this.getElementValue(n.item(0));
}

public final String getElementValue( Node elem ) {
         Node child;
         if( elem != null){
             if (elem.hasChildNodes()){
                 for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
                     if( child.getNodeType() == Node.TEXT_NODE  ){
                         return child.getNodeValue();
                     }
                 }
             }
         }
         return "";
  } 
}

From my main activity:

//Parse the XML content
            xmlParser parser = new xmlParser();
            Log.d(TAG, "1");
            Document rssDoc = parser.getDomElement(rssFilePath, rssFileName);
            Log.d(TAG, "2");
            final NodeList nl = rssDoc.getElementsByTagName(KEY_ITEM);
            Log.d(TAG, "3");

            //Make it all look nice and strip HTML
            for (int i = 0; i < nl.getLength(); i++){

                Element e = (Element) nl.item(i);

                String noHtmlTitle = parser.getValue(e, KEY_TITLE).toString().replaceAll("\\<.*?>", "");
                noHtmlTitle = noHtmlTitle.replaceAll("/n", "");

                noHtmlTitle = noHtmlTitle.trim();

                titles.add(noHtmlTitle);

                String noHtmlDesc = parser.getValue(e, KEY_DESC).toString().replaceAll("\\<.*?>", "");
                noHtmlDesc = noHtmlDesc.trim(); 
                descs.add("\n" + noHtmlDesc);

            }

However, when this code is presented with the above "title" "/title" tags, it returns a blank string. This appears to be related to the fact that the "title" tags do not contain any HTML tags.

How can I retrieve a usable string from the title tags?

Let me know if I have excluded any required data.

Edit:

As per blahdiblah, the data type being returned was CDATA_SECTION_NODE. I modified the getElementValue method to include this data type:

public final String getElementValue( Node elem ) {
         Node child;
         if( elem != null){
             if (elem.hasChildNodes()){
                 for( child = elem.getFirstChild(); child != null; child = child.getNextSibling() ){
                     if( child.getNodeType() == Node.TEXT_NODE  ){
                         return child.getNodeValue();
                     }else if (child.getNodeType() == Node.CDATA_SECTION_NODE){
                         return child.getNodeValue();
                     }
                 }
             }
         }
         return "";
  } 

Your XMLParser is only returning content for textual nodes ( child.getNodeType() == Node.TEXT_NODE ), but <title> is of type CDATA_SECTION_NODE .

Note that title is almost certainly of being sent as CDATA instead of plain text so that it can include HTML formatting and other odd characters. Make sure to test with a wide variety of input to make sure that you parse it correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM