[英]How to split a string into chunks of a particular byte size?
I am interacting with an api that accepts strings that are a maximum 5KB in size.我正在与一个接受最大大小为 5KB 的字符串的 api 交互。
I want to take a string that may be more than 5KB and break it into chunks less than 5KB in size.我想取一个可能超过 5KB 的字符串并将其分成大小小于 5KB 的块。
I then intend to pass each smaller-than-5kb-string
to the api endpoint, and perform further actions when all requests have finished, probably using something like:然后我打算将每个
smaller-than-5kb-string
传递给 api 端点,并在所有请求完成后执行进一步的操作,可能使用类似的东西:
await Promise.all([get_thing_from_api(string_1), get_thing_from_api(string_2), get_thing_from_api(string_3)])
I have read that characters in a string can be between 1 - 4 bytes.我读过字符串中的字符可以在 1 - 4 个字节之间。
For this reason, to calculate string length in bytes we can use:因此,要计算以字节为单位的字符串长度,我们可以使用:
// in Node, string is UTF-8
Buffer.byteLength("here is some text");
// in Javascript
new Blob(["here is some text"]).size
Source:资源:
https://stackoverflow.com/a/56026151 https://stackoverflow.com/a/56026151
https://stackoverflow.com/a/52254083 https://stackoverflow.com/a/52254083
My searches for "how to split strings into chunks of a certain size"
return results that relate to splitting a string into strings of a particular character length, not byte length, eg:我对
"how to split strings into chunks of a certain size"
搜索返回与将字符串拆分为特定字符长度而非字节长度的字符串相关的结果,例如:
var my_string = "1234 5 678905"; console.log(my_string.match(/.{1,2}/g)); // ["12", "34", " 5", " 6", "78", "90", "5"]
Source:资源:
https://stackoverflow.com/a/7033662 https://stackoverflow.com/a/7033662
https://stackoverflow.com/a/6259543 https://stackoverflow.com/a/6259543
https://gist.github.com/hendriklammers/5231994 https://gist.github.com/hendriklammers/5231994
Question问题
Is there a way to split a string into strings of a particular byte length?有没有办法将字符串拆分为特定字节长度的字符串?
I could either:我可以:
but would prefer a more accurate solution.但更喜欢更准确的解决方案。
I would be interested to know of both Node and plain JavaScript solutions, if they exist.我很想知道 Node 和纯 JavaScript 解决方案,如果它们存在的话。
EDIT编辑
This approach to calculating byteLength
might be helpful - by iterating over characters in a string, getting their character code and incrementing byteLength
accordingly:这种计算
byteLength
的方法可能会有所帮助 - 通过迭代字符串中的字符,获取它们的字符代码并相应地增加byteLength
:
function byteLength(str) {
// returns the byte length of an utf8 string
var s = str.length;
for (var i=str.length-1; i>=0; i--) {
var code = str.charCodeAt(i);
if (code > 0x7f && code <= 0x7ff) s++;
else if (code > 0x7ff && code <= 0xffff) s+=2;
if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
}
return s;
}
Source: https://stackoverflow.com/a/23329386来源: https ://stackoverflow.com/a/23329386
which led me to interesting experiments into the underlying data structures of Buffer :这让我对Buffer 的底层数据结构进行了有趣的实验:
var buf = Buffer.from('Hey! ф');
// <Buffer 48 65 79 21 20 d1 84>
buf.length // 7
buf.toString().charCodeAt(0) // 72
buf.toString().charCodeAt(5) // 1092
buf.toString().charCodeAt(6) // NaN
buf[0] // 72
for (let i = 0; i < buf.length; i++) {
console.log(buf[i]);
}
// 72 101 121 33 32 209 132 undefined
buf.slice(0,5).toString() // 'Hey! '
buf.slice(0,6).toString() // 'Hey! �'
buf.slice(0,7).toString() // 'Hey! ф'
but as @trincot pointed out in the comments, what is the correct way to handle multibyte characters?但正如@trincot 在评论中指出的那样,处理多字节字符的正确方法是什么? And how could I ensure chunks were split on spaces (so as not to 'break apart' a word?)
以及如何确保在空格上拆分块(以免“分解”一个词?)
More info on Buffer: https://nodejs.org/api/buffer.html#buffer_buffer有关缓冲区的更多信息: https ://nodejs.org/api/buffer.html#buffer_buffer
EDIT编辑
In case it helps anyone else understand the brilliant logic in the accepted answer, the snippet below is a heavily commented version I made so I could understand it better.如果它可以帮助其他人理解接受的答案中的精彩逻辑,下面的片段是我制作的一个重度评论版本,这样我可以更好地理解它。
/** * Takes a string and returns an array of substrings that are smaller than maxBytes. * * This is an overly commented version of the non-generator version of the accepted answer, * in case it helps anyone understand its (brilliant) logic. * * Both plain js and node variations are shown below - simply un/comment out your preference * * @param {string} s - the string to be chunked * @param {maxBytes} maxBytes - the maximum size of a chunk, in bytes * @return {arrray} - an array of strings less than maxBytes (except in extreme edge cases) */ function chunk(s, maxBytes) { // for plain js const decoder = new TextDecoder("utf-8"); let buf = new TextEncoder("utf-8").encode(s); // for node // let buf = Buffer.from(s); const result = []; var counter = 0; while (buf.length) { console.log("=============== BEG LOOP " + counter + " ==============="); console.log("result is now:"); console.log(result); console.log("buf is now:"); // for plain js console.log(decoder.decode(buf)); // for node // console.log(buf.toString()); /* get index of the last space character in the first chunk, searching backwards from the maxBytes + 1 index */ let i = buf.lastIndexOf(32, maxBytes + 1); console.log("i is: " + i); /* if no space is found in the first chunk, get index of the first space character in the whole string, searching forwards from 0 - in edge cases where characters between spaces exceeds maxBytes, eg chunk("123456789x 1", 9), the chunk will exceed maxBytes */ if (i < 0) i = buf.indexOf(32, maxBytes); console.log("at first condition, i is: " + i); /* if there's no space at all, take the whole string, again an edge case like chunk("123456789x", 9) will exceed maxBytes*/ if (i < 0) i = buf.length; console.log("at second condition, i is: " + i); // this is a safe cut-off point; never half-way a multi-byte // because the index is always the index of a space console.log("pushing buf.slice from 0 to " + i + " into result array"); // for plain js result.push(decoder.decode(buf.slice(0, i))); // for node // result.push(buf.slice(0, i).toString()); console.log("buf.slicing with value: " + (i + 1)); // slice the string from the index + 1 forwards // it won't erroneously slice out a value after i, because i is a space buf = buf.slice(i + 1); // skip space (if any) console.log("=============== END LOOP " + counter + " ==============="); counter++; } return result; } console.log(chunk("Hey there! € 100 to pay", 12));
Using Buffer
seems indeed the right direction.使用
Buffer
似乎确实是正确的方向。 Given that:鉴于:
Buffer
prototype has indexOf
and lastIndexOf
methods, and Buffer
原型有indexOf
和lastIndexOf
方法,并且 ... you can proceed as follows: ...您可以进行如下操作:
function chunk(s, maxBytes) {
let buf = Buffer.from(s);
const result = [];
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take the whole string
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
result.push(buf.slice(0, i).toString());
buf = buf.slice(i+1); // Skip space (if any)
}
return result;
}
console.log(chunk("Hey there! € 100 to pay", 12));
// -> [ 'Hey there!', '€ 100 to', 'pay' ]
You can consider extending this to also look for TAB, LF, or CR as split-characters.您可以考虑将其扩展为也将 TAB、LF 或 CR 视为拆分字符。 If so, and your input text can have CRLF sequences, you would need to detect those as well to avoid getting orphaned CR or LF characters in the chunks.
如果是这样,并且您的输入文本可以具有 CRLF 序列,您还需要检测这些序列以避免在块中获得孤立的 CR 或 LF 字符。
You can turn the above function into a generator, so that you control when you want to start the processing for getting the next chunk:您可以将上述函数转换为生成器,以便控制何时开始处理以获取下一个块:
function * chunk(s, maxBytes) {
let buf = Buffer.from(s);
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take all
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
yield buf.slice(0, i).toString();
buf = buf.slice(i+1); // Skip space (if any)
}
}
for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);
Buffer
is specific to Node. Buffer
特定于节点。 Browsers however implement TextEncoder
and TextDecoder
, which leads to similar code:然而,浏览器实现了
TextEncoder
和TextDecoder
,这导致了类似的代码:
function * chunk(s, maxBytes) { const decoder = new TextDecoder("utf-8"); let buf = new TextEncoder("utf-8").encode(s); while (buf.length) { let i = buf.lastIndexOf(32, maxBytes+1); // If no space found, try forward search if (i < 0) i = buf.indexOf(32, maxBytes); // If there's no space at all, take all if (i < 0) i = buf.length; // This is a safe cut-off point; never half-way a multi-byte yield decoder.decode(buf.slice(0, i)); buf = buf.slice(i+1); // Skip space (if any) } } for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);
A possible solution is to count every char bytes一个可能的解决方案是计算每个 char 字节
function charByteCounter(char){
let ch = char.charCodeAt(0) // get char
let counter = 0
while(ch) {
counter++;
ch = ch >> 8 // shift value down by 1 byte
}
return counter
}
function * chunk(string, maxBytes) {
let byteCounter = 0
let buildString = ''
for(const char of string){
const bytes = charByteCounter(char)
if(byteCounter + bytes > maxBytes){ // check if the current bytes + this char bytes is greater than maxBytes
yield buildString // string with less or equal bytes number to maxBytes
buildString = char
byteCounter = bytes
continue
}
buildString += char
byteCounter += bytes
}
yield buildString
}
for (const s of chunk("Hey! 😃, nice to meet you!", 12))
console.log(s);
Sources:资料来源:
Small addition to @trincot's answer: @trincot 回答的一个小补充:
If the string you are splitting contains a space (" "), then the returned array is always at least split into 2, even when the full string would fit into maxBytes
(so should return only 1 item).如果您要拆分的字符串包含空格(“”),则返回的数组始终至少拆分为 2,即使完整的字符串适合
maxBytes
(因此应该只返回 1 个项目)。
To fix this I added a check in the first line of the while loop:为了解决这个问题,我在 while 循环的第一行添加了一个检查:
export function chunkText (text: string, maxBytes: number): string[] {
let buf = Buffer.from(text)
const result = []
while (buf.length) {
let i = buf.length >= maxBytes ? buf.lastIndexOf(32, maxBytes + 1) : buf.length
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes)
// If there's no space at all, take the whole string
if (i < 0) i = buf.length
// This is a safe cut-off point; never half-way a multi-byte
result.push(buf.slice(0, i).toString())
buf = buf.slice(i+1) // Skip space (if any)
}
return result
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.