简体   繁体   中英

Sax parser encoding in Java

I have a problem with sax parser and encoded text. I try to parse RSS in ISO-8859-2 ( http://www.sbazar.cz/rss.xml?keyword=pes ) this way:

InputStream responseStream = connection.getInputStream();
Response response = mRequest.createResponse();

Reader reader = new InputStreamReader(responseStream);
InputSource is = new InputSource(reader);
is.setEncoding("ISO-8859-2");

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(is, response);

but parser returns strings with strange symbols. I tried a lot of things, but nothing helped me :( Can somebody help me please?

在此输入图像描述

Have you tried setting the charset of the InputStreamReader:

Reader reader = new InputStreamReader(responseStream, Charset.forName("ISO-8859-2"));
InputSource is = new InputSource(reader);

The InputStreamReader(InputStream) constructor, if you don't specify the charset, uses the default charset (which in my machine is windows-1252).

So in your current set up, the bytes are being interpreted as (probably) windows-1252 characters, after which i don't think you can re-interpret them as ISO-8859-2.

Sax is able to autodetect the encoding if it's given an input stream, not a reader.

InputSource is = new InputSource(responseStream)

Probably in your case you wanted a hardcoded encoding and you got the answer on how to do it. But I was looking for a general solution and found one here: Howto let the SAX parser determine the encoding from the xml declaration?

Documentation: InputSource in java 5 (note that java 1.4 documentation lacks the crucial sentence). autodetecting the character encoding using an algorithm such as the one in the XML specification . That refers to byte stream, but not to character stream ( Reader )

As I was digging more in XML documentation ( Autodetection of Character Encodings ), I found an explanation of the difference between treating Reader and Stream . To apply all of the encoding algorithms Sax must have access to raw stream, not converted to characters, because the conversion could corrupt byte markers.

Finally, I solved my problem using Rome library . It works well also with ISO-8859-2. Here is the source code, how to use Rome:

String urlstring = "http://www.sbazar.cz/rss.xml?keyword=pes";
InputStream is = new URL(urlstring).openConnection().getInputStream();
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = (SyndFeed)input.build(new InputStreamReader(is, Charset.forName("ISO-8859-2")));

Iterator entries = feed.getEntries().iterator();
while (entries.hasNext())
{
    SyndEntry entry = (SyndEntry)entries.next();
    Log.d("RSS", "-------------");
    Log.d("RSS", "Title: " + entry.getTitle());
    Log.d("RSS", "Published: " + entry.getPublishedDate());

    if (entry.getDescription() != null) 
    {
        Log.d("RSS", "Description: " + entry.getDescription().getValue());
    }
    if (entry.getContents().size() > 0) 
    {
        SyndContent content = (SyndContent)entry.getContents().get(0);
        Log.d("RSS", "Content type=" + content.getType());
        Log.d("RSS", "Content value=" + content.getValue());
    }
} 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM