UTF-8 & ISO-8859-1 not working for decoding European charset in Java

Question

Hi I have an HTML page from which I am scrapping data. The page uses UTF-8 charset and contains German and other European letters

<meta charset="utf-8">

But when I'm trying to decode it as ISO-8859-1 and UTF-8 in Java but nothing really works. I'm not able to get the European Characters instead I get values like:

Bayern MÃ¼nchen
Bor. MÃ¶nchengladbach
JÃ©rÃ´me Boateng

Following is the piece of my code:

               URL myUrl = new URL("http://www.weltfussball.de/spielplan/bundesliga-"
                                + season + "-spieltag/" + gameDay + "/");

    in = new BufferedReader(new InputStreamReader(myUrl.openStream(), "ISO-8859-1"));

                while ((line = in.readLine()) != null) {
                    all += line;
                }

One thing that I have noticed is when I print String line; it correctly prints all the Latin Characters on the java console, but as soon as I concatenate it to String all; the characters mess up... Can anyone suggest a solution?

Answer 1

First, try and see whether the page really uses UTF-8 as it pretends it does:

final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

try (
    final InputStream in = url.openStream();
    final Reader reader = new InputStreamReader(in, decoder);
) {
    /* read the contents */
}

If this program throws a MalformedInputException then you know the page is lying.

Given your output however, I suspect the problem is that your display does not read UTF-8 correctly.

Answer 2

这始终有效。

InputStream is = getClass().getResourceAsStream(myUrl); byte[] b = new byte[is.available()]; int l = is.read(b); String body = new String(b, 0, l, "UTF-8"); // whatever your charset you want

Answer 3

Make sure that the "ISO-8859-1" is the only being read . Otherwise it isn't going to work. I had the same problem today, I took out 30 mins to read this article http://www.joelonsoftware.com/articles/Unicode.html and then I solved my problem and now I know what is decoded, why people use this, why this is good and the limitations of himself.

To solve my problem I only replaced this tag in my Header Template file :

meta http-equiv="content-type" content="text/html; charset=UTF-8"

FOR:

meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"

Reload the browser and my European names with weirds characters were now being printed properly :)

Sorry for bad english!

UTF-8 & ISO-8859-1 not working for decoding European charset in Java

Question

3 answers

solution1
0 2015-01-20 18:34:25

solution2
0 2015-01-22 14:43:08

solution3
0 2015-02-10 16:20:41

UTF-8 & ISO-8859-1 not working for decoding European charset in Java

Question

3 answers

solution1 0 2015-01-20 18:34:25

solution2 0 2015-01-22 14:43:08

solution3 0 2015-02-10 16:20:41

solution1
0 2015-01-20 18:34:25

solution2
0 2015-01-22 14:43:08

solution3
0 2015-02-10 16:20:41