简体   繁体   中英

Nasa Rss feed Sax parsing error

I am trying to write a java program for reading NASA Rss feed.The code works but when the code encounters 's symbol ,it doesnot read the entire line. For example-"A new NASA study finds the last remaining section of Antarctica&#039 ;s Larsen B Ice Shelf, which partially collapsed in 2002, is quickly weakening and likely to disintegrate completely before the end of the decade". In this above line the code does not read the entire line after Antartica. What is the problem with the code ???How can I fix it??? Without the &#039 ;s symbol the code works fine. The link to the feed:" http://www.nasa.gov/rss/dyn/earth.rss "

package xmlparseprac;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Handler extends DefaultHandler {
boolean mtitle=false;
boolean mdescription=false;
boolean mitem;

@Override
public void startDocument() throws SAXException {
    super.startDocument(); 
    System.out.println("Starting...");
}

@Override
public void endDocument() throws SAXException {
    super.endDocument(); 
    System.out.println("Ending...");
}

@Override
public void startElement(String string, String string1, String string2, Attributes atrbts) throws SAXException {
    super.startElement(string, string1, string2, atrbts); 
    if(string2.equalsIgnoreCase("item")){mitem=true;}
    if(string2.equalsIgnoreCase("title")){mtitle=true;}
    if(string2.equalsIgnoreCase("description")){mdescription=true;}
}

@Override
public void endElement(String string, String string1, String string2) throws SAXException {
    super.endElement(string, string1, string2);
    if(string2.equalsIgnoreCase("item")){mitem=false;}
    if(string2.equalsIgnoreCase("title")){mtitle=false;}
    if(string2.equalsIgnoreCase("description")){mdescription=false;}
}

@Override
public void characters(char[] chars, int i, int i1) throws SAXException {
    super.characters(chars, i, i1);
    if(mtitle==true && mitem==true){
        String s=new String(chars, i, i1);
        System.out.println("Title:"+s);
        mtitle=false;}
    if(mdescription==true && mitem==true){
        String s=new String(chars, i, i1);
        System.out.println("Description:"+s);
        mdescription=false;
    }
}

}

I finally found the answer to my question.

link:" http://www.javaexperience.com/strip-invalid-characters-from-xml/ " link:" https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringEscapeUtils.html "

The commons apache-lang-StringEscapeUitls library contains a method called unescapeHtml4 .It removes the html encoding characters like &#039 etc with 's and other equivalent characters.Just convert the URL inputstream to a string and use the unescapeHtml14 function to the string and extract a inputstream from it and call the parse function with the inputstream as parameter.Thanks @duffymo for informing me about the "magic characters".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM