简体   繁体   English

Java的字符集/字符编码

[英]Java's charsets / character encoding

I have a file in Spanish so it's full of characters like: 我有一个西班牙文件,所以它充满了以下字符:

 á é í ó ú ñ Ñ Á É Í Ó Ú 

I have to read the file, so I do this: 我必须阅读该文件,所以我这样做:

fr = new FileReader(ficheroEntrada);
BufferedReader rEntrada = new BufferedReader(fr);

String linea = rEntrada.readLine();
if (linea == null) {
logger.error("ERROR: Empty file.");
return null;
} 
String delimitador = "[;]";
String[] tokens = null;

List<String> token = new ArrayList<String>();
while ((linea = rEntrada.readLine()) != null) {
    // Some parsing specific to my file. 
    tokens = linea.split(delimitador);
    token.add(tokens[0]);
    token.add(tokens[1]);
}
logger.info("List of tokens: " + token);
return token;

When I read the list of tokens, all the special characters are gone and have been replaced by this kind of characters: 当我读取令牌列表时,所有特殊字符都消失了,并被这种字符替换:

Ó = Ó
Ñ = Ñ

And so on... 等等...

What's happening? 发生了什么? I had never had problems with the charsets (I'm assuming is a charset issue). 我从未遇到过charsets的问题(我假设是charset问题)。 Is it because of this computer? 是因为这台电脑吗? What can I do? 我能做什么?

Any extra advice will be appreciated, I'm learning! 任何额外的建议将不胜感激,我正在学习! Thank you! 谢谢!

You need to specify related character encoding. 您需要指定相关的字符编码。

BufferedReader rEntrada  = new BufferedReader(
    new InputStreamReader(new FileInputStream(fr), "UTF-8"));

What's happening? 发生了什么?

The answers recommending reading and writing using UTF-8 encoding should fix your problem. 建议使用UTF-8编码进行读写的答案应该可以解决您的问题。 My answer is more about what happened and how to diagnose similar problems in the future. 我的答案更多的是关于将来发生的事情以及如何诊断类似的问题。

The first place to start is the UTF-8 character table at http://www.utf8-chartable.de . 首先是http://www.utf8-chartable.de上的UTF-8字符表。 There is a drop down on the page which lets you browse different portions of Unicode. 页面上有一个下拉菜单,可让您浏览Unicode的不同部分。 One of your problem characters is Ó . 你的一个问题是Ó Checking the chart reveals that if your file was encoded in UTF-8, then the character is U+00D3 LATIN CAPITAL LETTER O WITH ACUTE and the UTF-8 sequence is two bytes, hex c3 93 检查图表显示,如果你的文件是用UTF-8编码的,那么字符是U+00D3 LATIN CAPITAL LETTER O WITH ACUTE ,UTF-8序列是两个字节,hex c3 93

Now let's check the ISO-8859-1 character set at http://en.wikipedia.org/wiki/ISO/IEC_8859-1 , since this is also a popular character set. 现在让我们检查一下http://en.wikipedia.org/wiki/ISO/IEC_8859-1上的ISO-8859-1字符集,因为这也是一个流行的字符集。 However this is one of those single-byte character sets. 然而,这是那些单字节字符集之一。 Every valid character is represented by a single byte, unlike UTF-8 where a character may be represented by 1, 2 or 3 bytes. 每个有效字符由单个字节表示,与UTF-8不同,其中字符可以由1,2或3个字节表示。

Note that the character at C3 looks like à but there is no character at 93. So your default encoding is probably not ISO-8859-1. 请注意,C3处的字符看起来像Ã但93处没有字符。所以您的默认编码可能不是ISO-8859-1。

Next lets check Windows 1252 at http://en.wikipedia.org/wiki/Windows-1252 . 接下来,请访问http://en.wikipedia.org/wiki/Windows-1252查看Windows 1252。 This is almost the same as ISO-8859-1 but fills in some of the blank spaces with useful characters. 这几乎与ISO-8859-1相同,但用一些有用的字符填充一些空格。 And there we have a match. 我们有一场比赛。 The sequence C3 93 in Windows 1252 is exactly the character string Ó Windows 1252中的序列C3 93正好是字符串Ó

What all this tells me is that your file is UTF-8 encoded however your Java environment is configured with Windows 1252 as it's default encoding. 这一切告诉我的是,您的文件是UTF-8编码的,但您的Java环境配置了Windows 1252,因为它是默认编码。 If you modify your code to explicitly specify the character set ("UTF-8") instead of using the default your code will be less likely to fail on different environments. 如果修改代码以显式指定字符集(“UTF-8”)而不是使用默认值,则代码在不同环境中失败的可能性会降低。

Keep in mind though - this could have just as easily happened the other way. 请记住 - 这可能就像其他方式一样容易发生。 If you have a file of primarily Spanish text, it could just as easily been an ISO-8859-1 or Windows 1252 encoded file. 如果您有一个主要是西班牙文本的文件,它可以很容易地成为ISO-8859-1或Windows 1252编码文件。 In which case your code running on your machine would have worked just fine and switching it to read "UTF-8" encoding would have created a different set of garbled characters. 在这种情况下,在您的机器上运行的代码可以正常运行并将其切换为“UTF-8”编码会创建一组不同的乱码。

This is part of the reason you are getting conflicting advice. 这是您获得相互矛盾的建议的部分原因。 Different people have encountered different mismatches based on their platform and so have discovered different fixes. 不同的人基于他们的平台遇到了不同的不匹配,因此发现了不同的修复。

When in doubt, I read the file in emacs and switch to hexl-mode so I can see the exact binary data in the file. 如果有疑问,我在emacs中读取文件并切换到hexl-mode,这样我就可以在文件中看到确切的二进制数据。 I'm sure there are better and more modern ways to do this. 我相信有更好,更现代的方法来做到这一点。

A final thought - it might be worth reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses! 最后的想法 - 可能值得阅读绝对最低每个软件开发人员,绝对必须知道Unicode和字符集(没有借口!

You have the default encoding wrong. 您的默认编码错误。 You probably need to read UTF8 or latin1. 您可能需要阅读UTF8或latin1。 See this snippet for setting the encoding on streams. 请参阅此代码段以在流上设置编码。 See also Java, default encoding 另请参见Java,默认编码

public class Program {

    public static void main(String... args)  {

        if (args.length != 2) {
            return ;
        }

        try {
            Reader reader = new InputStreamReader(
                        new FileInputStream(args[0]),"UTF-8");
            BufferedReader fin = new BufferedReader(reader);
            Writer writer = new OutputStreamWriter(
                       new FileOutputStream(args[1]), "UTF-8");
            BufferedWriter fout = new BufferedWriter(writer);
            String s;
            while ((s=fin.readLine())!=null) {
                fout.write(s);
                fout.newLine();
            }

            //Remember to call close. 
            //calling close on a BufferedReader/BufferedWriter 
            // will automatically call close on its underlying stream 
            fin.close();
            fout.close();

        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

In my experience, the text file should be read and written based on Western encoding: ISO-8859-1. 根据我的经验,文本文件应该基于西方编码来读写:ISO-8859-1。

BufferedReader rEntrada = new BufferedReader( new InputStreamReader(new FileInputStream(fr), "ISO-8859-1")); BufferedReader rEntrada = new BufferedReader(new InputStreamReader(new FileInputStream(fr),“ISO-8859-1”));

The other answers provide you a right direction. 其他答案为您提供了正确的方向。 Just wanted to add that Guava with its Files.newReader(File,Charset) helper method makes creating such a BufferedReader a lot readable (pardon the pun): 只想添加Guava及其Files.newReader(File,Charset)帮助器方法使得创建这样一个BufferedReader很多可读(请原谅双关语):

BufferedReader rEntrada = Files.newReader(new File(ficheroEntrada), Charsets.UTF_8);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM