转换后的 word 文档（从 Windows-1252 到 UTF-8）不能正确显示字符

Question

I have a Windows-1252 word document that I want to convert to UTF-8.我有一个 Windows-1252 word 文档，我想将其转换为 UTF-8。 I need to do this to correctly convert the doc file to a pdf.我需要这样做才能将 doc 文件正确转换为 pdf。 This is how I currently do it:这就是我目前的做法：

 Path source = Paths.get("source.doc");
 Path temp = Paths.get("temp.doc");    
 try (BufferedReader sourceReader = new BufferedReader(new InputStreamReader(new FileInputStream(source.toFile()), "windows-1252"));
      BufferedWriter tempWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp.toFile()), "UTF-8"))) {
        String line;
        while ((line = sourceReader.readLine()) != null) {
           tempWriter.write(line);
        }
  }

However, when I open the converted file ( temp.doc ) in Word, it doesn't display some characters correctly.但是，当我在 Word 中打开转换后的文件 ( temp.doc ) 时，它无法正确显示某些字符。 Ü will become Ã¼ for example.例如，Ü 将变为 ü。

How can I solve this?我该如何解决这个问题？ When I create a new BufferedReader (with UTF-8 encoding) and I read temp , the characters are shown correctly in the console of my IDE.当我创建一个新的 BufferedReader（使用 UTF-8 编码）并读取temp时，字符在我的 IDE 的控制台中正确显示。

Answer 1

I have a Windows-1252 word document 我有Windows-1252 Word文档

That's not a text file. 那不是文本文件。 Word documents are basically binary data - open it up with a plain text editor and you'll see all kinds of gibberish. Word文档基本上是二进制数据-使用纯文本编辑器打开它，您将看到各种乱码。 You may see some text in there as well, but basically it's not a plain text file, which is how you're trying to read it. 您可能还会在其中看到一些文本，但是基本上它不是纯文本文件，这是您尝试阅读的方式。

It's not even clear to me what a "Windows-1252 word document" means... Word will use whatever encoding it wants internally, and I'm not sure there's any control over that. 我什至还不清楚“ Windows-1252 word文档”的含义是什么... Word将在内部使用它想要的任何编码，但我不确定对此是否有任何控制。 I would expect any decent "doc to PDF" converter to handle any valid Word document. 我希望任何像样的“ doc to PDF”转换器都能处理任何有效的Word文档。

When I create a new BufferedReader (with UTF-8 encoding) and I read temp, the characters are shown correctly in the console of my IDE. 当我创建一个新的BufferedReader（使用UTF-8编码）并读取temp时，这些字符会在IDE的控制台中正确显示。

If that's the case, that suggests it is a plain text file to start with, not a Word document. 如果是这种情况，则表明它是一个纯文本文件开头，而不是Word文档。 You need to be very clear in your own mind exactly what you've got - a Word document, or a plain text file. 你需要在你自己的头脑很清楚你有什么-一个Word文档或纯文本文件。 They're not the same thing, and shouldn't be treated the same way. 它们不是一回事，不应以相同的方式对待。

Answer 2

To force Excel to use UTF-8 encoding, I used the byte order mark (before any other content): 为了强制Excel使用UTF-8编码，我使用了字节顺序标记（在任何其他内容之前）：

tempWriter.write('\uFEFF'); // byte order mark to let Excel know to expect UTF-8

Hope this helps! 希望这可以帮助！

转换后的 word 文档（从 Windows-1252 到 UTF-8）不能正确显示字符

问题描述

1 个解决方案

解决方案1
2 2014-05-21 09:46:49

解决方案2
-1 2014-05-21 09:49:59

转换后的 word 文档（从 Windows-1252 到 UTF-8）不能正确显示字符

问题描述

1 个解决方案

解决方案1 2 2014-05-21 09:46:49

解决方案2 -1 2014-05-21 09:49:59

解决方案1
2 2014-05-21 09:46:49

解决方案2
-1 2014-05-21 09:49:59