简体   繁体   中英

Error reading UTF-8 file in Java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters

This is the code I have:

public static String readSentence(String resourceName) {

    String sentence = null;
    try {
        InputStream refStream = ClassLoader
                .getSystemResourceAsStream(resourceName);
        BufferedReader br = new BufferedReader(new InputStreamReader(
                refStream, Charset.forName("UTF-8")));
        sentence = br.readLine();
    } catch (IOException e) {
        throw new RuntimeException("Cannot read sentence: " + resourceName);
    }
    return sentence.trim();
}

The problem is probably in the way that the string is being output.

I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:

for (char c : sentence.toCharArray()) {
    System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}

and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.

First, you could create the InputStreamReader as

new InputStreamReader(refStream, "UTF-8")

Also, you should verify if the resource really contains UTF-8 content.

One of the most annoying reason could be... your IDE settings.

If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM