Splitting a string containing multi-byte characters into an array of strings

Question

I have this piece of code which is intended to split strings into an array of strings using CHUNK_SIZE as the size of the split, in bytes (I'm doing this for paginating results). This works in most cases when characters are 1 byte, but when I have a multi-byte character (such as for example 2-byte french characters (like é) or 4 byte chinese characters) at precisely the split location, I end up with 2 unreadable characters at the end of my first array element and at the start of the second one.

Is there a way to fix the code to account for multibyte characters so they are maintained in the final result?

public static ArrayList<String> splitFile(String data) throws Exception {
    ArrayList<String> messages = new ArrayList<>();
    int CHUNK_SIZE = 400000;// 0.75mb

    if (data.getBytes().length > CHUNK_SIZE) {
        byte[] buffer = new byte[CHUNK_SIZE];
        int start = 0, end = buffer.length;
        long remaining = data.getBytes().length;
        ByteArrayInputStream inputStream =
                new ByteArrayInputStream(data.getBytes());

        while ((inputStream.read(buffer, start, end)) != -1) {
            ByteArrayOutputStream outputStream =
                    new ByteArrayOutputStream();
            outputStream.write(buffer, start, end);
            messages.add(outputStream.toString("UTF-8"));
            remaining = remaining - end;

            if (remaining <= end) {
                end = (int) remaining;
            }
        }
        return messages;
    }

    messages.add(data);
    return messages;
}

Answer 1

You want to:

count characters not bytes
use regex for the chunk size and word boundary sensitivity
write less code

ergo,

private static int CHUNK_SIZE = 400000;

public static ArrayList<String> splitFile(String data) {
    return Arrays.asList(data.split("(?s)(?<=\\G.{1," + CHUNK_SIZE + "}\\b) +"));
}

Breaking down the regex:

(?s means “dot should match new lines”
\G means “the end of the last match”, and is initialized to start of input
\b means “word boundary”
(?<=\G.{1,400000}\b) means “preceded by the end of the last match then up to 400000 characters then a word boundary”

Not sure if you really need a List returned or not. You could just return the string array from the split.

Answer 2

public static List<String> splitFile(String data) throws IOException {
    List<String> messages = new ArrayList<>();
    final int CHUNK_SIZE = 400_000;// 0.75mb

    byte[] dataBytes = data.getBytes(StandardCharsets.UTF_8);
    byte[] buffer = new byte[CHUNK_SIZE];
    int start = 0;
    final int end = CHUNK_SIZE;
    ByteArrayInputStream inputStream = new ByteArrayInputStream(dataBytes);

    for (; ; ) {
        int read = inputStream.read(buffer, start, end - start);
        if (read == -1) {
            if (start != 0) {
                messages.add(new String(buffer, 0, start,
                        StandardCharsets.UTF_8));
            }
            break;
        }
        // Check for half read multi-byte sequences:
        int fullEnd = start + read;
        while (fullEnd > 0) {
            byte b = buffer[fullEnd - 1];
            if (b >= 0) { // ASCII.
                break;
            }
            if ((b & 0xC0) == 0xC0) { // Start byte of sequence.
                --fullEnd;
                break;
            }
            --fullEnd;
        }
        messages.add(new String(buffer, 0, fullEnd, StandardCharsets.UTF_8));
        start += read - fullEnd;
        if (start > 0) { // Copy the bytes after fullEnd to the start.
            System.arraycopy(buffer, fullEnd, buffer, 0, start);
            //               src     srcI     dest    destI len
        }
    }
    return messages;
}

I have kept the ByteArrayInputStream, as most often one reads from InputStream, instead of having all bytes in memory.

Then the chunk buffer is read, from start rather then from 0, as there might linger some bytes from the prior chunk read.

Reading gives the number of bytes read or -1.

At the end an ASCII char is okay, otherwise I position the end at the beginning of a multibyte sequence. Maybe that sequence is completely read, maybe not. Here I just keep it for the next chunk being read.

This code did not see a compiler.

A List of messages is not memory friendly too.

BTW on char[] one would have a similar problem, sometimes a Unicode code point, symbol, is two (UTF-16) chars.

Answer 3

Since you are doing this for paginating results , it may be useful to split this text not by characters but by words. You can iterate over the indices of the characters of this string and check each word whether at least half of it fits on the page, and if not, start a new page.

Example with limited line size on one page. It works the same with limited page size in multi-page document:

String text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, " +
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
        "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
        "nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in " +
        "reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla " +
        "pariatur. Excepteur sint occaecat cupidatat non proident, sunt in " +
        "culpa qui officia deserunt mollit anim id est laborum.";

int length = 55;

ArrayList<String> lines = new ArrayList<>();

int lastWord = 0;
int lastLine = 0;
for (int i = 0; i < text.length(); i++) {
    if (text.charAt(i) == ' ') {
        if (i - lastLine + (i - lastWord) / 2 > length) {
            lines.add(text.substring(lastLine, i));
            lastLine = i + 1;
        }
        lastWord = i + 1;
    }
}
lines.add(text.substring(lastLine));

// output line by line
lines.forEach(System.out::println);

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.

^{See also: How to split a string after a certain length?} ^{But it should be divided after word completion}

Splitting a string containing multi-byte characters into an array of strings

Question

3 answers

solution1
2 2020-12-29 02:36:37

solution2
1 ACCPTED 2020-12-29 03:13:03

solution3
1

Splitting a string containing multi-byte characters into an array of strings

Question

3 answers

solution1 2 2020-12-29 02:36:37

solution2 1 ACCPTED 2020-12-29 03:13:03

solution3 1

solution1
2 2020-12-29 02:36:37

solution2
1 ACCPTED 2020-12-29 03:13:03

solution3
1