简体   繁体   English

使用 utf-8 编码的 utf-8 读取文件不起作用,但使用“windows-1252”或“iso-8859-1”读取相同的文件可以

[英]Reading a file using utf-8 that is encoded in utf-8 doesn't work, but reading the same file using “windows-1252” or “iso-8859-1” does

What is happening here?这里发生了什么? Why when I read the file using utf-8 does it output questionmarks in the console?为什么当我使用 utf-8 读取文件时,控制台中会出现 output 问号?

This is a minimal working example:这是一个最小的工作示例:

图片

import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
    
import static org.apache.commons.io.FileUtils.readFileToString;
import static org.apache.commons.io.FileUtils.writeStringToFile;
    
public class Main {
    
    public static void main(String... args) throws IOException {
    
        System.out.println("---------");
        System.out.println(Charset.defaultCharset());
        System.out.println("æ ø å");
        System.out.println("æ ø å");
        System.out.println("æ ø å");
    
        File inputFile  = new File(System.getProperty("user.dir") + "/input.md");
        File outputFile = new File(System.getProperty("user.dir") + "/output.md");
    
        String content, encoding;
    
        System.out.println("--------- windows-1252");
        encoding = "windows-1252";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- iso-8859-1");
        encoding = "iso-8859-1";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        System.out.println("--------- utf-8");
        encoding = "utf-8";
        content = readFileToString(inputFile, encoding);
        System.out.println(content);
    
    
        writeStringToFile(outputFile, content, encoding);
    
    }
    
}

Where input.md contains: (encoded in UTF-8)其中input.md包含:(以 UTF-8 编码)

This is input.md. 'æ' 'ø' 'å'

Running the above code yields运行上面的代码产生

---------
windows-1252
æ ø å
æ ø å
æ ø å
--------- windows-1252
This is file C. 'æ' 'ø' 'å'.
--------- iso-8859-1
This is file C. 'æ' 'ø' 'å'.
--------- utf-8
This is file C. '�' '�' '�'.

Why do I get when I read the file using UTF-8?为什么我在使用 读取文件时会得到 �? This is especially weird since the file is encoded in UTF-8 .这特别奇怪,因为该文件是在 UTF-8 中编码的

UPDATE : My console is set to "UTF-8":更新:我的控制台设置为“UTF-8”:

图片

Here is a screenshot of the hex values of each char in string extracted from the input file:这是从输入文件中提取的字符串中每个字符的十六进制值的屏幕截图:

图片

Here is a better screenshot of the hex isolated:这是一个更好的十六进制隔离截图:

图片

The code looks fine to me, and your output.md file looks OK.代码对我来说看起来不错,您的output.md文件看起来不错。 So this is most likely just an issue with the console output.所以这很可能只是控制台 output 的问题。

The Unicode characters you are experimenting with are encoded as the same single bytes in both Windows-1252 and ISO-8859-1 ( æ = 0xE6 , ø = 0xF8 , å = 0xE5 ), but are encoded as multiple bytes in UTF-8 ( æ = 0xC3 0xA6 , ø = 0xC3 0xB8 , å = 0xC3 0xA5 ).您正在试验的 Unicode 字符在 Windows-1252 和 ISO-8859-1 中被编码为相同的单字节( æ = 0xE6ø = 0xF8å = 0xE5 ),但在 ZAE3B3DF9970B49B6523E608 æ = 0xC3 0xA6 759 中被编码为多个字节( æ = 0xC3 0xA6 , ø = 0xC3 0xB8 , å = 0xC3 0xA5 )。

Reading a UTF-8 encoded file as either Windows-1252 or ISO-8859-1 will decode each byte individually, producing a string containing a separate char for each byte, and those char s will have the same numeric values as the bytes.以 Windows-1252 或 ISO-8859-1 读取 UTF-8 编码文件将单独解码每个字节,为每个字节生成一个包含单独charstring ,并且这些char将具有与字节相同的数值。 So, you should be ending up with a string containing chars 0x00C3 0x00A6 , 0x00C3 0x00B8 , and 0x00C3 0x00A5 .因此,您应该得到一个包含字符0x00C3 0x00A60x00C3 0x00B80x00C3 0x00A5string Outputting those char s to the console as Windows-1252 should be showing as æ ø Ã¥ , not as æ ø å .将这些char作为 Windows-1252 输出到控制台应该显示为æ ø Ã¥ ,而不是æ ø å

On the other hand, reading a UTF-8 encoded file as UTF-8 will decode the file properly, producing a string with char s 0x00E6 , 0x00F8 , and 0x00E5 .另一方面,将 UTF-8 编码文件读取为 UTF-8 将正确解码文件,生成带有string char0x00F80x00E60x00E5 Writing that string to a UTF-8 encoded file should be producing the correct byte sequences ( 0xC3 0xA6 , 0xC3 0xB8 , and 0xC3 0xA5 ), but outputting that same string as Windows-1252 risks data loss, but you should be seeing æ ø å as expected, since Windows-1252 does support those Unicode characters.将该string写入 UTF-8 编码文件应生成正确的字节序列( 0xC3 0xA60xC3 0xB80xC3 0xA5 ),但输出与 Windows-1252 相同的string可能会导致数据丢失,但您应该会看到预期的æ ø å ,因为 Windows-1252 确实支持那些 Unicode 字符。

So, your results are actually backwards from what I would expect.因此,您的结果实际上与我的预期相反。 Even though Charset.defaultCharset() is reporting Windows-1252, I suspect your console is actually using a different charset for its output.即使Charset.defaultCharset()正在报告 Windows-1252,我怀疑您的控制台实际上是在为其 output 使用不同的字符集。

I suggest you print out the numeric values of the individual char s of the content string to see exactly how input.md is actually being decoded by each encoding.我建议您打印出content字符串的各个char的数值,以准确了解input.md是如何被每种编码实际解码的。 You should be getting the char values I mentioned above.应该得到我上面提到的char值。

For people with similar issues, the problem lies with the encoding of the console (as @Remy Lebeau points out too).对于有类似问题的人来说,问题在于控制台的编码(正如@Remy Lebeau 所指出的那样)。

I fixed the issue by following this answer我按照这个答案解决了这个问题

Actually, I followed @Nicolas answer in the comment to eh mentioned answer:实际上,我在评论中关注了@Nicolas 对提到的答案的回答:

This is also accessible from Help > Edit custom VM options... then restart IntelliJ.这也可以从 Help > Edit custom VM options... 访问,然后重新启动 IntelliJ。 I literally tried everything: changing encoding settings everywhere in IntelliJ, changing JVM options set by properties file, build.gradle file, IntelliJ, run configuration, environment variable, etc. Also tried changing system wide encoding nothing worked but this我真的尝试了一切:在 IntelliJ 中随处更改编码设置,更改属性文件设置的 JVM 选项,build.gradle 文件,IntelliJ,运行配置,环境变量等。还尝试更改系统范围的编码,但没有任何效果

Now I get the expected output:现在我得到了预期的 output:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM