简体   繁体   English

将字符串截断为 1 MB 大小限制

[英]truncate the string to 1 MB size limit

I need to cut the string - basically if the string if longer that 1 MB I should cut it to this size.我需要切割字符串 - 基本上,如果字符串超过 1 MB,我应该将其切割成这个大小。

I am using these functions to check the string size我正在使用这些函数来检查字符串大小

function __to_mb(bytes) {
   return bytes / Math.pow(1024, 2)
}

function __size_mb(str) {
  return __to_mb(Buffer.byteLength(str, 'utf8'))
}

Then I check the size of string like this然后我像这样检查字符串的大小

if (__size_mb(str) > 1) { /* do something */ }

But how to cut it?但是怎么剪呢?

A Javascript string consists of 16-bit sequences, with some characters using one 16-bit sequence and others needing two 16-bit sequences .一个 Javascript 字符串由 16 位序列组成,一些字符使用一个 16 位序列,其他字符需要两个 16 位序列

There is no easy way to just take an amount of bytes and consider it done - there might be a 2x 16-bit character at both sides of the cut-off location, which would then be cut in half.没有简单的方法可以只取一定数量的字节并认为它已完成 - 在截止位置的两侧可能有一个2x 16 位字符,然后将其切成两半。

To make a safe cut, we can use str.codePointAt(index) which was introduced in ES2015.为了安全起见,我们可以使用str.codePointAt(index)中引入的str.codePointAt(index) It knows which characters are 16-bit and which are 2x 16-bit.它知道哪些字符是 16 位的,哪些是2x 16 位的。 It combines either 1 or 2 of these 16-bit values into an integer result value.它将这些 16 位值中的 1 个或 2 个组合成一个整数结果值。

  • If codePointAt() returns a value <= 2^16-1 then we have a 16-bit character at offset index .如果codePointAt()返回一个值 <= 2^16-1那么我们在偏移index处有一个 16 位字符。
  • If codePointAt() returns a value >= 2^16 then we have a 2x 16-bit character at offsets index and index+1 .如果codePointAt()返回一个 >= 2^16的值,那么我们在偏移量indexindex+1处有一个2x 16 位字符。

Unfortunately this means going through the entire string to assess each index.不幸的是,这意味着遍历整个字符串来评估每个索引。 This may seem awkward, and it may even be slow, but I am not aware of a faster or smarter way of doing this.这可能看起来很尴尬,甚至可能很慢,但我不知道这样做的更快或更聪明的方法。

Demo:演示:

 var str = "abç🔥😂déΩf👍g😏h"; // string of 13 characters console.log("str.length = " + str.length); // shows 17 because of double-width chars console.log("size in bytes = " + str.length * 2); // length * 2 gives size in bytes var maxByteLengths = [8, 16, 24, 32, 40]; for (var maxBytes of maxByteLengths) { var data = safeCutOff(str, maxBytes); console.log(maxBytes + " bytes -> " + data.text + " (" + data.bytes + " bytes)"); } function safeCutOff(str, maxBytes) { let widthInBytes = 0; for (var index = 0; index < str.length; /* index is incremented below */ ) { let positionsUsed = str.codePointAt(index) <= 0xFFFF ? 1 : 2; newWidthInBytes = widthInBytes + 2 * positionsUsed; if (newWidthInBytes > maxBytes) break; index += positionsUsed; widthInBytes = newWidthInBytes; } return { text: str.substring(0, index), bytes: widthInBytes }; }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM