简体   繁体   English

如何将字符串拆分为特定字节大小的块?

[英]How to split a string into chunks of a particular byte size?

I am interacting with an api that accepts strings that are a maximum 5KB in size.我正在与一个接受最大大小为 5KB 的字符串的 api 交互。

I want to take a string that may be more than 5KB and break it into chunks less than 5KB in size.我想取一个可能超过 5KB 的字符串并将其分成大小小于 5KB 的块。

I then intend to pass each smaller-than-5kb-string to the api endpoint, and perform further actions when all requests have finished, probably using something like:然后我打算将每个smaller-than-5kb-string传递给 api 端点,并在所有请求完成后执行进一步的操作,可能使用类似的东西:

await Promise.all([get_thing_from_api(string_1), get_thing_from_api(string_2), get_thing_from_api(string_3)])

I have read that characters in a string can be between 1 - 4 bytes.我读过字符串中的字符可以在 1 - 4 个字节之间。

For this reason, to calculate string length in bytes we can use:因此,要计算以字节为单位的字符串长度,我们可以使用:

// in Node, string is UTF-8    
Buffer.byteLength("here is some text"); 

// in Javascript  
new Blob(["here is some text"]).size

Source:资源:
https://stackoverflow.com/a/56026151 https://stackoverflow.com/a/56026151
https://stackoverflow.com/a/52254083 https://stackoverflow.com/a/52254083

My searches for "how to split strings into chunks of a certain size" return results that relate to splitting a string into strings of a particular character length, not byte length, eg:我对"how to split strings into chunks of a certain size"搜索返回与将字符串拆分为特定字符长度而非字节长度的字符串相关的结果,例如:

 var my_string = "1234 5 678905"; console.log(my_string.match(/.{1,2}/g)); // ["12", "34", " 5", " 6", "78", "90", "5"]

Source:资源:
https://stackoverflow.com/a/7033662 https://stackoverflow.com/a/7033662
https://stackoverflow.com/a/6259543 https://stackoverflow.com/a/6259543
https://gist.github.com/hendriklammers/5231994 https://gist.github.com/hendriklammers/5231994

Question问题

Is there a way to split a string into strings of a particular byte length?有没有办法将字符串拆分为特定字节长度的字符串?

I could either:可以

  • assume that strings will only contain 1 byte per character假设字符串每个字符只包含 1 个字节
  • allow for the 'worst case scenario' that each character is 4 bytes允许每个字符为 4 个字节的“最坏情况”

but would prefer a more accurate solution.但更喜欢更准确的解决方案。

I would be interested to know of both Node and plain JavaScript solutions, if they exist.我很想知道 Node 和纯 JavaScript 解决方案,如果它们存在的话。

EDIT编辑

This approach to calculating byteLength might be helpful - by iterating over characters in a string, getting their character code and incrementing byteLength accordingly:这种计算byteLength的方法可能会有所帮助 - 通过迭代字符串中的字符,获取它们的字符代码并相应地增加byteLength

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

Source: https://stackoverflow.com/a/23329386来源: https ://stackoverflow.com/a/23329386

which led me to interesting experiments into the underlying data structures of Buffer :这让我对Buffer 的底层数据结构进行了有趣的实验:

var buf = Buffer.from('Hey! ф');
// <Buffer 48 65 79 21 20 d1 84>  
buf.length // 7
buf.toString().charCodeAt(0) // 72
buf.toString().charCodeAt(5) // 1092  
buf.toString().charCodeAt(6) // NaN    
buf[0] // 72
for (let i = 0; i < buf.length; i++) {
  console.log(buf[i]);
}
// 72 101 121 33 32 209 132 undefined
buf.slice(0,5).toString() // 'Hey! '
buf.slice(0,6).toString() // 'Hey! �'
buf.slice(0,7).toString() // 'Hey! ф'

but as @trincot pointed out in the comments, what is the correct way to handle multibyte characters?但正如@trincot 在评论中指出的那样,处理多字节字符的正确方法是什么? And how could I ensure chunks were split on spaces (so as not to 'break apart' a word?)以及如何确保在空格上拆分块(以免“分解”一个词?)

More info on Buffer: https://nodejs.org/api/buffer.html#buffer_buffer有关缓冲区的更多信息: https ://nodejs.org/api/buffer.html#buffer_buffer

EDIT编辑

In case it helps anyone else understand the brilliant logic in the accepted answer, the snippet below is a heavily commented version I made so I could understand it better.如果它可以帮助其他人理解接受的答案中的精彩逻辑,下面的片段是我制作的一个重度评论版本,这样我可以更好地理解它。

 /** * Takes a string and returns an array of substrings that are smaller than maxBytes. * * This is an overly commented version of the non-generator version of the accepted answer, * in case it helps anyone understand its (brilliant) logic. * * Both plain js and node variations are shown below - simply un/comment out your preference * * @param {string} s - the string to be chunked * @param {maxBytes} maxBytes - the maximum size of a chunk, in bytes * @return {arrray} - an array of strings less than maxBytes (except in extreme edge cases) */ function chunk(s, maxBytes) { // for plain js const decoder = new TextDecoder("utf-8"); let buf = new TextEncoder("utf-8").encode(s); // for node // let buf = Buffer.from(s); const result = []; var counter = 0; while (buf.length) { console.log("=============== BEG LOOP " + counter + " ==============="); console.log("result is now:"); console.log(result); console.log("buf is now:"); // for plain js console.log(decoder.decode(buf)); // for node // console.log(buf.toString()); /* get index of the last space character in the first chunk, searching backwards from the maxBytes + 1 index */ let i = buf.lastIndexOf(32, maxBytes + 1); console.log("i is: " + i); /* if no space is found in the first chunk, get index of the first space character in the whole string, searching forwards from 0 - in edge cases where characters between spaces exceeds maxBytes, eg chunk("123456789x 1", 9), the chunk will exceed maxBytes */ if (i < 0) i = buf.indexOf(32, maxBytes); console.log("at first condition, i is: " + i); /* if there's no space at all, take the whole string, again an edge case like chunk("123456789x", 9) will exceed maxBytes*/ if (i < 0) i = buf.length; console.log("at second condition, i is: " + i); // this is a safe cut-off point; never half-way a multi-byte // because the index is always the index of a space console.log("pushing buf.slice from 0 to " + i + " into result array"); // for plain js result.push(decoder.decode(buf.slice(0, i))); // for node // result.push(buf.slice(0, i).toString()); console.log("buf.slicing with value: " + (i + 1)); // slice the string from the index + 1 forwards // it won't erroneously slice out a value after i, because i is a space buf = buf.slice(i + 1); // skip space (if any) console.log("=============== END LOOP " + counter + " ==============="); counter++; } return result; } console.log(chunk("Hey there! € 100 to pay", 12));

Using Buffer seems indeed the right direction.使用Buffer似乎确实是正确的方向。 Given that:鉴于:

  • Buffer prototype has indexOf and lastIndexOf methods, and Buffer原型有indexOflastIndexOf方法,并且
  • 32 is the ASCII code of a space, and 32 是空格的 ASCII 码,并且
  • 32 can never occur as part of a multi-byte character since all the bytes that make up a multi-byte sequence always have the most significant bit set . 32 永远不会作为多字节字符的一部分出现,因为构成多字节序列的所有字节始终都设置了最高有效位

... you can proceed as follows: ...您可以进行如下操作:

function chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    const result = [];
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take the whole string
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        result.push(buf.slice(0, i).toString());
        buf = buf.slice(i+1); // Skip space (if any)
    }
    return result;
}

console.log(chunk("Hey there! € 100 to pay", 12)); 
// -> [ 'Hey there!', '€ 100 to', 'pay' ]

You can consider extending this to also look for TAB, LF, or CR as split-characters.您可以考虑将其扩展为也将 TAB、LF 或 CR 视为拆分字符。 If so, and your input text can have CRLF sequences, you would need to detect those as well to avoid getting orphaned CR or LF characters in the chunks.如果是这样,并且您的输入文本可以具有 CRLF 序列,您还需要检测这些序列以避免在块中获得孤立的 CR 或 LF 字符。

You can turn the above function into a generator, so that you control when you want to start the processing for getting the next chunk:您可以将上述函数转换为生成器,以便控制何时开始处理以获取下一个块:

function * chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield buf.slice(0, i).toString();
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

Browsers浏览器

Buffer is specific to Node. Buffer特定于节点。 Browsers however implement TextEncoder and TextDecoder , which leads to similar code:然而,浏览器实现了TextEncoderTextDecoder ,这导致了类似的代码:

 function * chunk(s, maxBytes) { const decoder = new TextDecoder("utf-8"); let buf = new TextEncoder("utf-8").encode(s); while (buf.length) { let i = buf.lastIndexOf(32, maxBytes+1); // If no space found, try forward search if (i < 0) i = buf.indexOf(32, maxBytes); // If there's no space at all, take all if (i < 0) i = buf.length; // This is a safe cut-off point; never half-way a multi-byte yield decoder.decode(buf.slice(0, i)); buf = buf.slice(i+1); // Skip space (if any) } } for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

A possible solution is to count every char bytes一个可能的解决方案是计算每个 char 字节

function charByteCounter(char){
    let ch = char.charCodeAt(0)  // get char 
    let counter = 0
    while(ch) {
        counter++;
      ch = ch >> 8 // shift value down by 1 byte
    }  
   
    return counter
}

function * chunk(string, maxBytes) {
    let byteCounter = 0
    let buildString = ''
    for(const char of string){
        const bytes = charByteCounter(char)
        if(byteCounter + bytes > maxBytes){ // check if the current bytes + this char bytes is greater than maxBytes
            yield buildString // string with less or equal bytes number to maxBytes
            buildString = char
            byteCounter = bytes
            continue
        }
        buildString += char
        byteCounter += bytes
    }

    yield buildString
}

for (const s of chunk("Hey! 😃, nice to meet you!", 12))
    console.log(s);

Sources:资料来源:

Small addition to @trincot's answer: @trincot 回答的一个小补充:

If the string you are splitting contains a space (" "), then the returned array is always at least split into 2, even when the full string would fit into maxBytes (so should return only 1 item).如果您要拆分的字符串包含空格(“”),则返回的数组始终至少拆分为 2,即使完整的字符串适合maxBytes (因此应该只返回 1 个项目)。

To fix this I added a check in the first line of the while loop:为了解决这个问题,我在 while 循环的第一行添加了一个检查:

export function chunkText (text: string, maxBytes: number): string[] {
  let buf = Buffer.from(text)
  const result = []
  while (buf.length) {
    let i = buf.length >= maxBytes ? buf.lastIndexOf(32, maxBytes + 1) : buf.length
    // If no space found, try forward search
    if (i < 0) i = buf.indexOf(32, maxBytes)
    // If there's no space at all, take the whole string
    if (i < 0) i = buf.length
    // This is a safe cut-off point; never half-way a multi-byte
    result.push(buf.slice(0, i).toString())
    buf = buf.slice(i+1) // Skip space (if any)
  }
  return result
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM