简体   繁体   中英

java/kotlin Normalizer fails to normalize some accented letters

I noticed that the Normalizer leaves some non-ascii letters alone, such as the first letter in the name of the Polish city Łódź. Here are some more:

import java.text.Normalizer

fun main() {
    for (i in 0xC0..0x170) {
        val ch = Char(i)
        if (!ch.isLetter()) continue
        val norm = Normalizer.normalize(ch.toString(), Normalizer.Form.NFD)
        if (norm.length >= 2) {
            // println("'$ch' => '${norm[0]}' ${norm[0].code} '${norm[1]}' ${norm[1].code}")
        } else {
            println("'$ch' => '${norm[0]}' ${norm[0].code}")
        }
    }
}

This prints:

'Æ' => 'Æ' 198
'Ð' => 'Ð' 208
'Ø' => 'Ø' 216
...
'IJ' => 'IJ' 306
'ij' => 'ij' 307
'ĸ' => 'ĸ' 312
'Ŀ' => 'Ŀ' 319
'ŀ' => 'ŀ' 320
'Ł' => 'Ł' 321
'ł' => 'ł' 322
'ʼn' => 'ʼn' 329
'Ŋ' => 'Ŋ' 330
'ŋ' => 'ŋ' 331
'Œ' => 'Œ' 338
'œ' => 'œ' 339
'Ŧ' => 'Ŧ' 358
'ŧ' => 'ŧ' 359

To me, this somewhat defeats the purpose of the Normalizer -- I assumed I could use it to get an equivalent ASCII for every character in the isLetter set.

Does anyone know whether this is considered a bug? If not, is there another method that would map 'Ł' to 'L', 'Æ' to 'AE', etc?

Here's some code using Collator . This, too, doesn't know about the Polish L! So, I accept @VGR's explanation, that some data must just be missing.

import java.text.Collator

fun main() {
    val s = "ŁóźÆæŒĸ"
    val sx = listOf("L","o","z","AE", "ae", "OE", "k")
    val c1 = Collator.getInstance()
    c1.setStrength(Collator.PRIMARY)
    for ((i, ch) in s.withIndex()) {
        val cmp1 = c1.compare(ch.toString(), sx[i])
        println("'$ch' '${sx[i]}' -> $cmp1")
    }
}

Results:

'Ł' 'L' -> 1
'ó' 'o' -> 0
'ź' 'z' -> 0
'Æ' 'AE' -> 0
'æ' 'ae' -> 0
'Œ' 'OE' -> 0
'ĸ' 'k' -> 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM