使用移位操作将代码点转换为Java中的utf-8字节数组

Question

I have used the accepted answer here to "manually" convert from unicode to UTF-8 code units. 我在这里使用了可接受的答案来“手动”将unicode转换为UTF-8代码单元。 The problem is that I need the resulting UTF-8 to be contained in a byte array. 问题是我需要将结果UTF-8包含在字节数组中。 How can I do that by using shifting operations whenever possible to go from hexadecimal to uft-8? 我如何通过尽可能地从十六进制转换为uft-8的移位操作来做到这一点？

The code I already have is the following: 我已经拥有的代码如下：

 public static void main(String[] args)
   throws UnsupportedEncodingException, CharacterCodingException {

   String st = "ñ";

   for (int i = 0; i < st.length(); i++) {
      int unicode = st.charAt(i);
      codepointToUTF8(unicode);
   }
 }

 public static byte[] codepointToUTF8(int codepoint) {
    byte[] hb = codepointToHexa(codepoint);
    byte[] binaryUtf8 = null;

    if (codepoint <= 0x7F) {
      binaryUtf8 = parseRange(hb, 8);
    } else if (codepoint <= 0x7FF) {
      binaryUtf8 = parseRange(hb, 16);
    } else if (codepoint <= 0xFFFF) {
      binaryUtf8 = parseRange(hb, 24);
    } else if (codepoint <= 0x1FFFFF) {
      binaryUtf8 = parseRange(hb, 32);
    }

    byte[] utf8Codeunits = new byte[hexStr.length()];
    for (int i = 0; i < hexStr.length(); i++) {
      utf8Codeunits[i] = (byte) hexStr.charAt(i);
      System.out.println(utf8Codeunits[i]); // prints 99 51 98 49,
      // which is the same as c3b1, the UTF-8 for ñ
    }

    return binaryUtf8;
  }


  public static byte[] codepointToHexa(int codepoint) {
    int n = codepoint;
    int m;

    List<Byte> list = new ArrayList<>();
    while (n >= 16) {
      m = n % 16;
      n = n / 16;
      list.add((byte) m);
    }
    list.add((byte) n);
    byte[] bytes = new byte[list.size()];
    for (int i = list.size() - 1; i >= 0; i--) {
      bytes[list.size() - i - 1] = list.get(i);
    }

    return bytes;
  }

  private static byte[] parseRange(byte[] hb, int length) {

    byte[] binarybyte = new byte[length];
    boolean[] filled = new boolean[length];

    int index = 0;
    if (length == 8) {
      binarybyte[0] = 0;
      filled[0] = true;
    } else {
      int cont = 0;
      while (cont < length / 8) {
        filled[index] = true;
        binarybyte[index++] = 1;
        cont++;
      }
      binarybyte[index] = 0;
      filled[index] = true;
      index = 8;
      while (index < length) {
        filled[index] = true;
        binarybyte[index++] = 1;
        binarybyte[index] = 0;
        filled[index] = true;
        index += 7;
      }
    }

    byte[] hbbinary = convertHexaArrayToBinaryArray(hb);
    int hbindex = hbbinary.length - 1;

    for (int i = length - 1; i >= 0; i--) {
      if (!filled[i] && hbindex >= 0) {
        // we fill it and advance the iterator
        binarybyte[i] = hbbinary[hbindex];
        hbindex--;
        filled[i] = true;
      } else if (!filled[i]) {
        binarybyte[i] = 0;
        filled[i] = true;
      }
    }
    return binarybyte;
  }

 private static byte[] convertHexaArrayToBinaryArray(byte[] hb) {

    byte[] binaryArray = new byte[hb.length * 4];
    String aux = "";
    for (int i = 0; i < hb.length; i++) {

      aux = Integer.toBinaryString(hb[i]);
      int length = aux.length();
      // toBinaryString doesn't return a 4 bit string, so we fill it with 0s
      // if length is not a multiple of 4
      while (length % 4 != 0) {
        length++;
        aux = "0" + aux;
      }

      for (int j = 0; j < aux.length(); j++) {
        binaryArray[i * 4 + j] = (byte) (aux.charAt(j) - '0');
      }
    }

    return binaryArray;
  }

I don't know how to handle bytes properly, so I'm aware that the things I did are probably wrong. 我不知道如何正确处理字节，所以我知道我所做的事情可能是错误的。

Answer 1

UTF-8 fills Unicode code points as follows: UTF-8如下填充Unicode代码点：

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
... (max 6 bytes)

Where the right most bit is the least significant one for the number. 其中，最右边的位数是该数字的最低位。

static byte[] utf8(IntStream codePoints) {
    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    final byte[] cpBytes = new byte[6]; // IndexOutOfBounds for too large code points
    codePoints.forEach((cp) -> {
        if (cp < 0) {
            throw new IllegalStateException("No negative code point allowed");
        } else if (cp < 0x80) {
            baos.write(cp);
        } else {
            int bi = 0;
            int lastPrefix = 0xC0;
            int lastMask = 0x1F;
            for (;;) {
                int b = 0x80 | (cp & 0x3F);
                cpBytes[bi] = (byte)b;
                ++bi;
                cp >>= 6;
                if ((cp & ~lastMask) == 0) {
                    cpBytes[bi] = (byte) (lastPrefix | cp);
                    ++bi;
                    break;
                }
                lastPrefix = 0x80 | (lastPrefix >> 1);
                lastMask >>= 1;
            }
            while (bi > 0) {
                --bi;
                baos.write(cpBytes[bi]);
            }
        }
    });
    return baos.toByteArray();
}

Except for the 7 bits ASCII the encoding can be done in a loop. 除7位ASCII外，编码可以循环执行。

使用移位操作将代码点转换为Java中的utf-8字节数组

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-07-13 11:43:05

使用移位操作将代码点转换为Java中的utf-8字节数组

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-07-13 11:43:05

解决方案1
2 已采纳 2016-07-13 11:43:05