Encoding issues

Question

I have a "windows1255" encoded String, is there any safe way i can convert it to a "UTF-8"

String and vice versa?

In general is there a safe way(meaning data will not be damaged) to convert between

Encodings in Java?

     str.getBytes("UTF-8");
     new String(str,"UTF-8");

if the original string is not encoded as "UTF-8" can the data be damaged?

Answer 1

You can can't have a String object in Java properly encoded as anything other than UTF-16 - as that's the sole encoding for those objects defined by the spec. Of course you could do something untoward like put 1252 values in a char[] and create a String from it, but things will go wrong pretty much immediately.

What you can have is byte[] encoded in various different ways, and you can convert them to and from String using constructors which take a Charset, and with getBytes as in your code.

So you can do conversions using a String as an intermediate. I don't know of any way in the JDK to do a direct conversion, but the intermediate is likely not too costly in practice.

About round-trip comversions - it is not generally true that you can convert between encoding without losing data. Only a few encodings can handle the full spectrum of Unicode characters (eg the UTF family, GB18030, etc) - while many legacy character sets encode only a small subset. You can't safely round trip through those character sets without losing data, unless you are sure the input falls into the representable set.

Answer 2

String is attempting to be a sequence of abstract characters, it does not have any encoding from the point of view of its users. Of course, it must have an internal encoding but that's an implementation detail.

It makes no sense to encode String as UTF-8, and then decode the result back as UTF-8. It will be no-op, in that:

(new String(str.getBytes("UTF-8"), "UTF-8") ).equals(str) == true;

But there are cases where the String abstraction falls apart and the above will be a "lossy" conversion. Because of the internal implementation details, a String can contain unpaired UTF-16 surrogates which cannot be represented in UTF-8 (or any encoding for that matter, including the internal UTF-16 encoding ^* ). So they will be lost in the encoding, and when you decode back, you get the original string without the invalid unpaired surrogates.

The only thing I can take from your question is that you have a String result from interpreting binary data as Windows-1255, where it should have been interpreted in UTF-8. To fix this, you would have to go to the source of this and use UTF-8 decoding explicitly.

If you however, only have the string result from misinterpretation, you can't really do anything as so many bytes have no representation in Windows-1255 and would have not made it to the string.

If this wasn't the case, you could fully restore the original intended message by:

new String( str.getBytes("Windows-1255"), "UTF-8");

^{* It is actually wrong of Java to allow unpaired surrogates to exist in its Strings in the first place since it's not valid UTF-16}

Encoding issues

Question

2 answers

solution1
2 ACCPTED 2013-02-03 11:12:01

solution2
1 2013-02-03 14:11:21

Encoding issues

Question

2 answers

solution1 2 ACCPTED 2013-02-03 11:12:01

solution2 1 2013-02-03 14:11:21

solution1
2 ACCPTED 2013-02-03 11:12:01

solution2
1 2013-02-03 14:11:21