简体   繁体   English

Java:字符串到字节数组的转换

[英]Java: String to byte array conversion

I am getting some unexpected results from what I thought was a simple test. 我从一个简单的测试中得到了一些意想不到的结果。 After running the following: 运行以下命令之后:

byte [] bytes = {(byte)0x40, (byte)0xE2, (byte)0x56, (byte)0xFF, (byte)0xAD, (byte)0xDC};
String s = new String(bytes, Charset.forName("UTF-8"));
byte[] bytes2 = s.getBytes(Charset.forName("UTF-8"));

bytes2 is a 14 element long array nothing like the original (bytes). bytes2是一个14个元素的长数组,与原始数组(字节)完全不同。 Is there a way to do this sort of conversion and retain the original decomposition to bytes? 有没有办法进行这种转换并将原始分解保留为字节?

Is there a way to do this sort of conversion and retain the original decomposition to bytes? 有没有办法进行这种转换并将原始分解保留为字节?

Well that doesn't look like valid UTF-8 to me, so I'm not surprised it didn't round-trip. 嗯,这对我来说似乎不是有效的UTF-8,所以我并不奇怪它没有往返。

If you want to convert arbitrary binary data to text in a reversible way, use base64, eg via this public domain encoder/decoder . 如果要以可逆的方式将任意二进制数据转换为文本,请使用base64,例如通过此公共域编码器/解码器

This should do: 应该这样做:

public class Main
{

    /*
     * This method converts a String to an array of bytes
     */
    public void convertStringToByteArray()
    {

        String stringToConvert = "This String is 76 characters long and will be converted to an array of bytes";

        byte[] theByteArray = stringToConvert.getBytes();

        System.out.println(theByteArray.length);

    }

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args)
    {    
        new Main().convertStringToByteArray();
    }
}

Two things: 两件事情:

  1. The byte sequence does not appear to be valid UTF-8 字节序列似乎无效的UTF-8

      $ python >>> '\\x40\\xe2\\x56\\xff\\xad\\xdc'.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 1: invalid continuation byte 
  2. Even if it were valid UTF-8, decoding and then encoding can result in different bytes due to things like precombined characters and other Unicode features. 即使它是有效的UTF-8,由于诸如预组合字符和其他Unicode功能之类的原因,解码然后进行编码也可能导致字节不同。

If you want to encode arbitrary binary data in a string in a way where you are guaranteed to get the same bytes back when you decode them, your best bet is something like base64. 如果要以确保在解码时返回相同字节的方式在字符串中编码任意二进制数据,最好的选择是像base64。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM