如何用Java将字符串切成1兆字节的subString？

Question

I have come up with the following: 我想出了以下几点：

public static void cutString(String s) {
    List<String> strings = new ArrayList<>();
    int index = 0;
    while (index < s.length()) {
        strings.add(s.substring(index, Math.min(index + 1048576, s.length())));
        index += 1048576;
    }
}

But my problem is, that using UTF-8 some character doesn't exactly take 1 byte, so using 1048576 to tell where to cut the String is not working. 但是我的问题是，使用UTF-8某些字符并不能完全占用1个字节，因此使用1048576告诉在哪里剪切字符串是行不通的。 I was thinking about maybe using Iterator, but that doesn't seem efficient. 我正在考虑也许使用Iterator，但这似乎并不高效。 What'd be the most efficient solution for this? 最有效的解决方案是什么？ The String can be smaller than 1 Mb to avoid character slicing, just not bigger than that! 为了避免字符切片，字符串可以小于1 Mb，但不能大于！

Answer 1

Quick, unsafe hack 快速，不安全的骇客

You can use s.getBytes("UTF-8") to get an array with the actual bytes used by each UTF-8 character. 您可以使用s.getBytes("UTF-8")获得一个数组，其中包含每个UTF-8字符使用的实际字节。 Like this: 像这样：

System.out.println("¡Adiós!".getBytes("UTF-8").length);
// Prints: 9

Once you have that, it's just a matter of splitting the byte array in chunks of length 1048576, and then turn the chunks back into UTF-8 strings with new String(chunk, "UTF-8") . 一旦有了，只需将字节数组拆分为长度为1048576的块，然后使用new String(chunk, "UTF-8")将这些块重新转换为UTF-8字符串。

However, by doing it like that you can break multi-byte characters at the beginning or end of the chunks . 但是，通过这样做， 您可以在块的开头或结尾处中断多字节字符 。 Say the 1048576th character is a 3-byte Unicode character: the first byte would go into the first chunk and the other two bytes would get put into the second chunk, thus breaking the encoding. 假设第1048576个字符是一个3字节的Unicode字符：第一个字节将进入第一个块，其他两个字节将进入第二个块，从而破坏编码。

Proper approach 正确的方法

If you can relax the "1 MB" requirement, you can take a safer approach: split the string in chunks of 1048576 characters (not bytes), and then test each chunk's real length with getBytes , removing chars from the end as needed until the real size is equal or less than 1 MB. 如果您可以放宽对“ 1 MB”的要求，则可以采用一种更安全的方法：将字符串分成1048576个字符（不是字节）的块，然后使用getBytes测试每个块的实际长度，并根据需要从末尾删除字符，直到实际大小等于或小于1 MB。

Here's an implementation that won't break characters, at the expense of having some lines smaller than the given size: 这是一个不会破坏字符的实现，但要牺牲一些行小于给定大小的行：

public static List<String> cutString(String original, int chunkSize, String encoding) throws UnsupportedEncodingException {
    List<String> strings = new ArrayList<>();
    final int end = original.length();
    int from = 0, to = 0;
    do {
        to = (to + chunkSize > end) ? end : to + chunkSize; // next chunk, watch out for small strings
        String chunk = original.substring(from, to); // get chunk
        while (chunk.getBytes(encoding).length > chunkSize) { // adjust chunk to proper byte size if necessary
            chunk = original.substring(from, --to);
        }
        strings.add(chunk); // add chunk to collection
        from = to; // next chunk
    } while (to < end);
    return strings;
}

I tested it with chunkSize = 24 so you could see the effect. 我用chunkSize = 24对其进行了测试，以便可以看到效果。 It should work as well with any other size: 它应该与任何其他大小一起工作：

    String test = "En la fase de maquetación de un documento o una página web o para probar un tipo de letra es necesario visualizar el aspecto del diseño. ٩(-̮̮̃-̃)۶ ٩(●̮̮̃•̃)۶ ٩(͡๏̯͡๏)۶ ٩(-̮̮̃•̃).";

    for (String chunk : cutString(test, 24, "UTF-8")) {
        System.out.println(String.format(
                "Chunk [%s] - Chars: %d - Bytes: %d",
                chunk, chunk.length(), chunk.getBytes("UTF-8").length));
    }
    /*
    Prints:
        Chunk [En la fase de maquetaci] - Chars: 23 - Bytes: 23
        Chunk [ón de un documento o un] - Chars: 23 - Bytes: 24
        Chunk [a página web o para pro] - Chars: 23 - Bytes: 24
        Chunk [bar un tipo de letra es ] - Chars: 24 - Bytes: 24
        Chunk [necesario visualizar el ] - Chars: 24 - Bytes: 24
        Chunk [aspecto del diseño. ٩(] - Chars: 22 - Bytes: 24
        Chunk [-̮̮̃-̃)۶ ٩(●̮̮] - Chars: 14 - Bytes: 24
        Chunk [̃•̃)۶ ٩(͡๏̯͡] - Chars: 12 - Bytes: 23
        Chunk [๏)۶ ٩(-̮̮̃•̃).] - Chars: 14 - Bytes: 24
     */

Another test with a 3 MB string like the one you mention in your comments: 另一个测试使用3 MB的字符串，例如您在评论中提到的字符串：

    String string = "0123456789ABCDEF";
    StringBuilder bigAssString = new StringBuilder(1024*1024*3);
    for (int i = 0; i < ((1024*1024*3)/16); i++) {
        bigAssString.append(string);
    }
    System.out.println("bigAssString.length = " + bigAssString.toString().length());
    bigAssString.replace((1024*1024*3)/4, ((1024*1024*3)/4)+1, "á");

    for (String chunk : cutString(bigAssString.toString(), 1024*1024, "UTF-8")) {
        System.out.println(String.format(
                "Chunk [...] - Chars: %d - Bytes: %d",
                chunk.length(), chunk.getBytes("UTF-8").length));
    }
    /*
    Prints:
        bigAssString.length = 3145728
        Chunk [...] - Chars: 1048575 - Bytes: 1048576
        Chunk [...] - Chars: 1048576 - Bytes: 1048576
        Chunk [...] - Chars: 1048576 - Bytes: 1048576
        Chunk [...] - Chars: 1 - Bytes: 1
     */

Answer 2

You can use a ByteArrayOutputStream with an OutputStreamWriter 您可以将ByteArrayOutputStream与OutputStreamWriter一起使用

   ByteArrayOutputStream out = new ByteArrayOutputStream();
    Writer w = OutputStreamWriter(out, "utf-8");
    //write everything to the writer
    w.write(myString);
    byte[] bytes = out.toByteArray();
    //now you have the actual size of the string, you can parcel by Mb. Be aware that problems may occur however if you have a multi-byte character separated into two locations

如何用Java将字符串切成1兆字节的subString？

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-04-19 15:39:45

Proper approach 正确的方法

解决方案2
1 2017-04-19 15:35:47

如何用Java将字符串切成1兆字节的subString？

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-04-19 15:39:45

Proper approach 正确的方法

解决方案2 1 2017-04-19 15:35:47

解决方案1
4 已采纳 2017-04-19 15:39:45

解决方案2
1 2017-04-19 15:35:47