简体   繁体   中英

String encoding (UTF-8) JAVA

Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.

String string2 = new String(string1.getBytes("UTF-8"), "UTF-8")); 

OR

String string3 = new String(string1.getBytes(),"UTF-8"));

ALSO if I use above two code together ie

line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8")); 
line 2 :string1 = new String(string1.getBytes(),"UTF-8")); 

Will the value of string1 will be the same in both the lines?

PS: Purpose of doing all this is to send Japanese text in web service call. So I want to send it with UTF-8 encoding.

According to the javadoc of String#getBytes(String charsetName) :

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

And the documentation of String(byte[] bytes, Charset charset)

Constructs a new String by decoding the specified array of bytes using the specified charset.

Thus getBytes() is opposite operation of String(byte []) . The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. Ie your second example is wrong:

// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will 
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8")); 

String and char (two-bytes UTF-16) in java is for (Unicode) text.

When converting from and to byte[] s one needs the Charset (encoding) of those bytes.

Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.

So use

byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");

Or better, not throwing an UnsupportedCharsetException:

byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);

(Android does not know StandardCharsets however.)

The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).

Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String .

Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[] .

There are no UTF-8-encoded strings in Java.

If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.

If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM