Java's charsets / character encoding

Question

I have a file in Spanish so it's full of characters like:

 á é í ó ú ñ Ñ Á É Í Ó Ú

I have to read the file, so I do this:

fr = new FileReader(ficheroEntrada);
BufferedReader rEntrada = new BufferedReader(fr);

String linea = rEntrada.readLine();
if (linea == null) {
logger.error("ERROR: Empty file.");
return null;
} 
String delimitador = "[;]";
String[] tokens = null;

List<String> token = new ArrayList<String>();
while ((linea = rEntrada.readLine()) != null) {
    // Some parsing specific to my file. 
    tokens = linea.split(delimitador);
    token.add(tokens[0]);
    token.add(tokens[1]);
}
logger.info("List of tokens: " + token);
return token;

When I read the list of tokens, all the special characters are gone and have been replaced by this kind of characters:

Ó = Ã“
Ñ = Ã‘

And so on...

What's happening? I had never had problems with the charsets (I'm assuming is a charset issue). Is it because of this computer? What can I do?

Any extra advice will be appreciated, I'm learning! Thank you!

Answer 1

You need to specify related character encoding.

BufferedReader rEntrada  = new BufferedReader(
    new InputStreamReader(new FileInputStream(fr), "UTF-8"));

Answer 2

What's happening?

The answers recommending reading and writing using UTF-8 encoding should fix your problem. My answer is more about what happened and how to diagnose similar problems in the future.

The first place to start is the UTF-8 character table at http://www.utf8-chartable.de . There is a drop down on the page which lets you browse different portions of Unicode. One of your problem characters is Ó . Checking the chart reveals that if your file was encoded in UTF-8, then the character is U+00D3 LATIN CAPITAL LETTER O WITH ACUTE and the UTF-8 sequence is two bytes, hex c3 93

Now let's check the ISO-8859-1 character set at http://en.wikipedia.org/wiki/ISO/IEC_8859-1 , since this is also a popular character set. However this is one of those single-byte character sets. Every valid character is represented by a single byte, unlike UTF-8 where a character may be represented by 1, 2 or 3 bytes.

Note that the character at C3 looks like Ã but there is no character at 93. So your default encoding is probably not ISO-8859-1.

Next lets check Windows 1252 at http://en.wikipedia.org/wiki/Windows-1252 . This is almost the same as ISO-8859-1 but fills in some of the blank spaces with useful characters. And there we have a match. The sequence C3 93 in Windows 1252 is exactly the character string Ã“

What all this tells me is that your file is UTF-8 encoded however your Java environment is configured with Windows 1252 as it's default encoding. If you modify your code to explicitly specify the character set ("UTF-8") instead of using the default your code will be less likely to fail on different environments.

Keep in mind though - this could have just as easily happened the other way. If you have a file of primarily Spanish text, it could just as easily been an ISO-8859-1 or Windows 1252 encoded file. In which case your code running on your machine would have worked just fine and switching it to read "UTF-8" encoding would have created a different set of garbled characters.

This is part of the reason you are getting conflicting advice. Different people have encountered different mismatches based on their platform and so have discovered different fixes.

When in doubt, I read the file in emacs and switch to hexl-mode so I can see the exact binary data in the file. I'm sure there are better and more modern ways to do this.

A final thought - it might be worth reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!

Answer 3

You have the default encoding wrong. You probably need to read UTF8 or latin1. See this snippet for setting the encoding on streams. See also Java, default encoding

public class Program {

    public static void main(String... args)  {

        if (args.length != 2) {
            return ;
        }

        try {
            Reader reader = new InputStreamReader(
                        new FileInputStream(args[0]),"UTF-8");
            BufferedReader fin = new BufferedReader(reader);
            Writer writer = new OutputStreamWriter(
                       new FileOutputStream(args[1]), "UTF-8");
            BufferedWriter fout = new BufferedWriter(writer);
            String s;
            while ((s=fin.readLine())!=null) {
                fout.write(s);
                fout.newLine();
            }

            //Remember to call close. 
            //calling close on a BufferedReader/BufferedWriter 
            // will automatically call close on its underlying stream 
            fin.close();
            fout.close();

        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Answer 4

In my experience, the text file should be read and written based on Western encoding: ISO-8859-1.

BufferedReader rEntrada = new BufferedReader( new InputStreamReader(new FileInputStream(fr), "ISO-8859-1"));

Answer 5

The other answers provide you a right direction. Just wanted to add that Guava with its Files.newReader(File,Charset) helper method makes creating such a BufferedReader a lot readable (pardon the pun):

BufferedReader rEntrada = Files.newReader(new File(ficheroEntrada), Charsets.UTF_8);

Java's charsets / character encoding

Question

5 answers

solution1
4 ACCPTED 2012-11-21 15:00:13

solution2
4 2012-11-21 15:42:28

solution3
2 2012-11-21 14:59:23

solution4
2 2012-11-21 15:07:45

solution5
0 2012-11-21 15:07:50

Java's charsets / character encoding

Question

5 answers

solution1 4 ACCPTED 2012-11-21 15:00:13

solution2 4 2012-11-21 15:42:28

solution3 2 2012-11-21 14:59:23

solution4 2 2012-11-21 15:07:45

solution5 0 2012-11-21 15:07:50

solution1
4 ACCPTED 2012-11-21 15:00:13

solution2
4 2012-11-21 15:42:28

solution3
2 2012-11-21 14:59:23

solution4
2 2012-11-21 15:07:45

solution5
0 2012-11-21 15:07:50