简体   繁体   中英

Java InputStreamReader Can not read Special(Turkish) Characters

Below you can see my code;

final BufferedReader br = new BufferedReader(
                new InputStreamReader(new FileInputStream(f),"UTF-8"));// tried also "iso-8859-9"
String strLine;
while ((strLine = br.readLine()) != null) {
    total += "\n" + strLine;
}
br.close();

Here below is the output.. what should i do?

insan n sec ld g combobox

The or U+FFFD character is a special character defined by Unicode as a "replacement character", a character to display when you encounter a character you don't recognize, or the byte data is malformed and a character cannot be read.

The InputStreamReader constructor you are using does not allow you to specify the behavior when there is malformed data or when a character is not recognized. It assumes you want the default behavior of using the "replacement character" when there is an unrecognized character or when the byte data is malformed, so that may be what your seeing.

If you examine your output and find that your Turkish characters are not there but have been replaced by the "replacement character" U+FFFD, you can change the behavior to throw an exception instead of using the replacement character -- an actual exception will make it easier to detect when data is in the wrong character set.

To specify this different behavior, use this version of InputStreamReader

public InputStreamReader(InputStream in, CharsetDecoder dec)

For the CharsetDecoder , pass in

charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT)
        .onUnmappableCharacter(CodingErrorAction.REPORT)

where charset is your character set of choice, eg StandardCharsets.UTF_8

That will cause an exception to be thrown rather than the replacement character inserted.

If you still see the replacement character and no exception is thrown, it's fairly clear that the problem is in how you are viewing the output.

So what's the actual file encoding? Open up a hex editor and look at the byte values for insan n (especially the broken character). Then when you have the byte value, you can find the actual encoding. Now you've just tried two wrong encodings at random.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM