Hi I have an HTML page from which I am scrapping data. The page uses UTF-8 charset and contains German and other European letters
<meta charset="utf-8">
But when I'm trying to decode it as ISO-8859-1
and UTF-8
in Java but nothing really works. I'm not able to get the European Characters instead I get values like:
Bayern München
Bor. Mönchengladbach
Jérôme Boateng
Following is the piece of my code:
URL myUrl = new URL("http://www.weltfussball.de/spielplan/bundesliga-"
+ season + "-spieltag/" + gameDay + "/");
in = new BufferedReader(new InputStreamReader(myUrl.openStream(), "ISO-8859-1"));
while ((line = in.readLine()) != null) {
all += line;
}
One thing that I have noticed is when I print String line;
it correctly prints all the Latin Characters on the java console, but as soon as I concatenate it to String all;
the characters mess up... Can anyone suggest a solution?
First, try and see whether the page really uses UTF-8 as it pretends it does:
final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT);
try (
final InputStream in = url.openStream();
final Reader reader = new InputStreamReader(in, decoder);
) {
/* read the contents */
}
If this program throws a MalformedInputException
then you know the page is lying.
Given your output however, I suspect the problem is that your display does not read UTF-8 correctly.
这始终有效。
InputStream is = getClass().getResourceAsStream(myUrl); byte[] b = new byte[is.available()]; int l = is.read(b); String body = new String(b, 0, l, "UTF-8"); // whatever your charset you want
Make sure that the "ISO-8859-1" is the only being read . Otherwise it isn't going to work. I had the same problem today, I took out 30 mins to read this article http://www.joelonsoftware.com/articles/Unicode.html and then I solved my problem and now I know what is decoded, why people use this, why this is good and the limitations of himself.
To solve my problem I only replaced this tag in my Header Template file :
meta http-equiv="content-type" content="text/html; charset=UTF-8"
FOR:
meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"
Reload the browser and my European names with weirds characters were now being printed properly :)
Sorry for bad english!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.