简体   繁体   中英

Convert between diacritic variants of a character

I'm passing a string as a parameter to command line tool written in swift.

I have a problem with some characters containing diacritics.

If I pass à á ả ã ạ й ё as a line argument, inside the app I got à á ả ã ạ й ё . It looks the same, but it's not:

func printUnicodeScalars(_ string: String) {
    print(string, "->", string.unicodeScalars.map { $0 })
}
printUnicodeScalars("à á ả ã ạ й ё")
// à á ả ã ạ й ё -> ["\u{00E0}", " ", "\u{00E1}", " ", "\u{1EA3}", " ", "\u{00E3}", " ", "\u{1EA1}", " ", "\u{0439}", " ", "\u{0451}"]
printUnicodeScalars("à á ả ã ạ й ё")
// à á ả ã ạ й ё -> ["a", "\u{0300}", " ", "a", "\u{0301}", " ", "a", "\u{0309}", " ", "a", "\u{0303}", " ", "a", "\u{0323}", " ", "\u{0438}", "\u{0306}", " ", "\u{0435}", "\u{0308}"]

I know that a diacritics character can in ASCII can be represented in different ways: like a single character, or like a combination of two: a letter and a diacritics.

For some reason command line tool converts first variant into the second one. Probably that's because it's limited to UTF-8.

How can I convert it back? Like to join many unicode-scalars character into a single one.

I think you need to use precomposedStringWithCanonicalMapping . This converts the string to Normalization Form C, which is:

Canonical Decomposition, followed by Canonical Composition

Example:

let string = "à á ả ã ạ й ё"
print(string.unicodeScalars.count) // 20
print(string.precomposedStringWithCanonicalMapping.unicodeScalars.count) // 13

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM