I have a problem with sax parser and encoded text. I try to parse RSS in ISO-8859-2 ( http://www.sbazar.cz/rss.xml?keyword=pes ) this way:
InputStream responseStream = connection.getInputStream();
Response response = mRequest.createResponse();
Reader reader = new InputStreamReader(responseStream);
InputSource is = new InputSource(reader);
is.setEncoding("ISO-8859-2");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(is, response);
but parser returns strings with strange symbols. I tried a lot of things, but nothing helped me :( Can somebody help me please?
Have you tried setting the charset of the InputStreamReader:
Reader reader = new InputStreamReader(responseStream, Charset.forName("ISO-8859-2"));
InputSource is = new InputSource(reader);
The InputStreamReader(InputStream) constructor, if you don't specify the charset, uses the default charset (which in my machine is windows-1252).
So in your current set up, the bytes are being interpreted as (probably) windows-1252 characters, after which i don't think you can re-interpret them as ISO-8859-2.
Sax is able to autodetect the encoding if it's given an input stream, not a reader.
InputSource is = new InputSource(responseStream)
Probably in your case you wanted a hardcoded encoding and you got the answer on how to do it. But I was looking for a general solution and found one here: Howto let the SAX parser determine the encoding from the xml declaration?
Documentation: InputSource in java 5 (note that java 1.4 documentation lacks the crucial sentence). autodetecting the character encoding using an algorithm such as the one in the XML specification . That refers to byte stream, but not to character stream ( Reader )
As I was digging more in XML documentation ( Autodetection of Character Encodings ), I found an explanation of the difference between treating Reader and Stream . To apply all of the encoding algorithms Sax must have access to raw stream, not converted to characters, because the conversion could corrupt byte markers.
Finally, I solved my problem using Rome library . It works well also with ISO-8859-2. Here is the source code, how to use Rome:
String urlstring = "http://www.sbazar.cz/rss.xml?keyword=pes";
InputStream is = new URL(urlstring).openConnection().getInputStream();
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = (SyndFeed)input.build(new InputStreamReader(is, Charset.forName("ISO-8859-2")));
Iterator entries = feed.getEntries().iterator();
while (entries.hasNext())
{
SyndEntry entry = (SyndEntry)entries.next();
Log.d("RSS", "-------------");
Log.d("RSS", "Title: " + entry.getTitle());
Log.d("RSS", "Published: " + entry.getPublishedDate());
if (entry.getDescription() != null)
{
Log.d("RSS", "Description: " + entry.getDescription().getValue());
}
if (entry.getContents().size() > 0)
{
SyndContent content = (SyndContent)entry.getContents().get(0);
Log.d("RSS", "Content type=" + content.getType());
Log.d("RSS", "Content value=" + content.getValue());
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.