简体   繁体   中英

Truncate unicode string to max bytes

I need to truncate a (possibly large) unicode string to a max size in bytes. Converting to UTF-16 and then back appears unreliable.

For example:

let flags = "🇵🇷🇵🇷"
let result = String(flags.utf16.prefix(3))

In this case result is nil.

I need an efficient way to perform this truncation. Ideas?

String in Swift goes by UnicodeScalar and each scalar can take multiple bytes to store. If you just take the first n bytes no matter what, chances are that these bytes will not form a correct substring in any encoding when you convert them back.

Now if you change the definition to "take up to the first n bytes that can form a valid substring", you can use the UTF8View :

extension String {
    func firstBytes(_ count: Int) -> UTF8View {
        guard count > 0 else { return self.utf8.prefix(0) }

        var actualByteCount = count
        while actualByteCount > 0 {
            let subview = self.utf8.prefix(actualByteCount)
            if let _ = String(subview) {
                return subview
            } else {
                actualByteCount -= 1
            }
        }

        return self.utf8.prefix(0)
    }
}

let flags = "welcome to 🇵🇷 and 🇺🇸"

let bytes1 = flags.firstBytes(11)

// the Puerto Rico flag character take 8 bytes to store
// so the actual number of bytes returned is 11, same as bytes1
let bytes2 = flags.firstBytes(13)

// now you can cover the string up to the Puerto Rico flag 
let bytes3 = flags.firstBytes(19)

print("'\(bytes1)'")
print("'\(bytes2)'")
print("'\(bytes3)'")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM