Character Encoding Trouble - Java

Question

I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.

I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.

Thanks

Answer 1

There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.

Typical problematic spots in the Java API are:

new String(byte[])
String.getBytes()
FileReader, FileWriter

All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).

However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes" , which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.

What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.

Answer 2

You can control which encoding your JVM will run with by supplying f,ex

-Dfile.encoding=utf-8

for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:

java -Dfile.encoding=utf-8 my.MainClass

Answer 3

Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.

Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (eg if you have an app server used by multiple applications)

If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.

Answer 4

I had to make sure that the OutputStreamWriter uses the correct encoding

OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);

Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");

Character Encoding Trouble - Java

Question

4 answers

solution1
5 ACCPTED 2009-04-07 10:31:24

solution2
3 2009-04-07 10:30:38

solution3
1 2009-04-07 10:44:46

solution4
0 2011-10-28 14:44:16

Character Encoding Trouble - Java

Question

4 answers

solution1 5 ACCPTED 2009-04-07 10:31:24

solution2 3 2009-04-07 10:30:38

solution3 1 2009-04-07 10:44:46

solution4 0 2011-10-28 14:44:16

solution1
5 ACCPTED 2009-04-07 10:31:24

solution2
3 2009-04-07 10:30:38

solution3
1 2009-04-07 10:44:46

solution4
0 2011-10-28 14:44:16