Converted word document (from Windows-1252 to UTF-8) not displaying characters correctly

Question

I have a Windows-1252 word document that I want to convert to UTF-8. I need to do this to correctly convert the doc file to a pdf. This is how I currently do it:

 Path source = Paths.get("source.doc");
 Path temp = Paths.get("temp.doc");    
 try (BufferedReader sourceReader = new BufferedReader(new InputStreamReader(new FileInputStream(source.toFile()), "windows-1252"));
      BufferedWriter tempWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(temp.toFile()), "UTF-8"))) {
        String line;
        while ((line = sourceReader.readLine()) != null) {
           tempWriter.write(line);
        }
  }

However, when I open the converted file ( temp.doc ) in Word, it doesn't display some characters correctly. Ü will become Ã¼ for example.

How can I solve this? When I create a new BufferedReader (with UTF-8 encoding) and I read temp , the characters are shown correctly in the console of my IDE.

Answer 1

I have a Windows-1252 word document

That's not a text file. Word documents are basically binary data - open it up with a plain text editor and you'll see all kinds of gibberish. You may see some text in there as well, but basically it's not a plain text file, which is how you're trying to read it.

It's not even clear to me what a "Windows-1252 word document" means... Word will use whatever encoding it wants internally, and I'm not sure there's any control over that. I would expect any decent "doc to PDF" converter to handle any valid Word document.

When I create a new BufferedReader (with UTF-8 encoding) and I read temp, the characters are shown correctly in the console of my IDE.

If that's the case, that suggests it is a plain text file to start with, not a Word document. You need to be very clear in your own mind exactly what you've got - a Word document, or a plain text file. They're not the same thing, and shouldn't be treated the same way.

Answer 2

To force Excel to use UTF-8 encoding, I used the byte order mark (before any other content):

tempWriter.write('\uFEFF'); // byte order mark to let Excel know to expect UTF-8

Hope this helps!

Converted word document (from Windows-1252 to UTF-8) not displaying characters correctly

Question

1 answers

solution1
2 2014-05-21 09:46:49

solution2
-1 2014-05-21 09:49:59

Converted word document (from Windows-1252 to UTF-8) not displaying characters correctly

Question

1 answers

solution1 2 2014-05-21 09:46:49

solution2 -1 2014-05-21 09:49:59

solution1
2 2014-05-21 09:46:49

solution2
-1 2014-05-21 09:49:59