獲取帶編碼的字符串大小（以字節為單位）而不轉換為 byte[]

Question

我有一種情況，我需要知道String /編碼對的大小（以字節為單位），但不能使用getBytes()方法，因為 1) String非常大，復制byte[]數組中的String將使用大量內存，但更重要的是 2) getBytes()根據String的長度 * 每個字符的最大可能字節數分配一個byte[]數組。因此，如果我有一個包含 1.5B 字符和 UTF-16 編碼的String ， getBytes()將嘗試分配一個 3GB 數組並失敗，因為數組限制為 2^32 - X 字節（X 是 Java 版本特定的）。

那么 - 有沒有辦法直接從String對象計算String /encoding 對的字節大小？

更新：

這是 jtahlborn 答案的有效實現：

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}

Answer 1

簡單，只需將其寫入虛擬輸出流：

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

它不僅簡單，而且可能與其他“復雜”答案一樣快。

Answer 2

同樣使用 apache-commons 庫：

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}

Answer 3

這是一個明顯有效的實現：

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

輸出是：

1400
1400

實際上，我ENCODE_CHUNK增加到 10MChars 左右。

可能比 brettw 的答案效率稍低，但實施起來更簡單。

Answer 4

根據這篇文章，番石榴有一個實現：

Utf8.encodedLength()

Answer 5

好吧，這太惡心了。 我承認，但是這個東西被JVM隱藏了，所以我們必須挖掘一點。 還有一點汗。

首先，我們需要實際的 char[] 支持 String 而不進行復制。 為此，我們必須使用反射來獲取“值”字段：

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

接下來你需要實現java.nio.ByteBuffer一個子java.nio.ByteBuffer 。 類似的東西：

class MyByteBuffer extends ByteBuffer {
    int length;            
    // Your implementation here
};

忽略所有的getter ，實現所有的put方法，如put(byte)和putChar(char)等。在put(byte)類的東西中， put(byte)長度增加 1，在put(byte[])內部，通過數組增加長度長度。 明白了嗎？ 放置的所有內容，您都將其大小添加到length 。 但是你沒有在你的ByteBuffer存儲任何東西，你只是在計算和扔掉，所以不占用空間。 如果您對put方法設置斷點，您可能會弄清楚您實際需要實現哪些方法。 例如，可能沒有使用putFloat(float) 。

現在是大結局，把它們放在一起：

MyByteBuffer bbuf = new MyByteBuffer();         // your "counting" buffer
CharBuffer cbuf = CharBuffer.wrap(chars);       // wrap your char array
Charset charset = Charset.forName("UTF-8");     // your charset goes here
CharsetEncoder encoder = charset.newEncoder();  // make a new encoder
encoder.encode(cbuf, bbuf, true);               // do it!
System.out.printf("Length: %d\n", bbuf.length); // pay me US$1,000,000

獲取帶編碼的字符串大小（以字節為單位）而不轉換為 byte[]

問題描述

5 個解決方案

解決方案1
12 已采納 2013-11-08 19:43:12

解決方案2
2 2017-10-30 14:56:30

解決方案3
1 2013-11-08 19:23:57

解決方案4
1 2020-04-13 21:24:12

解決方案5
-2 2013-11-08 08:27:15

獲取帶編碼的字符串大小（以字節為單位）而不轉換為 byte[]

問題描述

5 個解決方案

解決方案1 12 已采納 2013-11-08 19:43:12

解決方案2 2 2017-10-30 14:56:30

解決方案3 1 2013-11-08 19:23:57

解決方案4 1 2020-04-13 21:24:12

解決方案5 -2 2013-11-08 08:27:15

解決方案1
12 已采納 2013-11-08 19:43:12

解決方案2
2 2017-10-30 14:56:30

解決方案3
1 2013-11-08 19:23:57

解決方案4
1 2020-04-13 21:24:12

解決方案5
-2 2013-11-08 08:27:15