I noticed that the Normalizer leaves some non-ascii letters alone, such as the first letter in the name of the Polish city Łódź. Here are some more:
import java.text.Normalizer
fun main() {
for (i in 0xC0..0x170) {
val ch = Char(i)
if (!ch.isLetter()) continue
val norm = Normalizer.normalize(ch.toString(), Normalizer.Form.NFD)
if (norm.length >= 2) {
// println("'$ch' => '${norm[0]}' ${norm[0].code} '${norm[1]}' ${norm[1].code}")
} else {
println("'$ch' => '${norm[0]}' ${norm[0].code}")
}
}
}
This prints:
'Æ' => 'Æ' 198
'Ð' => 'Ð' 208
'Ø' => 'Ø' 216
...
'IJ' => 'IJ' 306
'ij' => 'ij' 307
'ĸ' => 'ĸ' 312
'Ŀ' => 'Ŀ' 319
'ŀ' => 'ŀ' 320
'Ł' => 'Ł' 321
'ł' => 'ł' 322
'ʼn' => 'ʼn' 329
'Ŋ' => 'Ŋ' 330
'ŋ' => 'ŋ' 331
'Œ' => 'Œ' 338
'œ' => 'œ' 339
'Ŧ' => 'Ŧ' 358
'ŧ' => 'ŧ' 359
To me, this somewhat defeats the purpose of the Normalizer -- I assumed I could use it to get an equivalent ASCII for every character in the isLetter
set.
Does anyone know whether this is considered a bug? If not, is there another method that would map 'Ł' to 'L', 'Æ' to 'AE', etc?
Here's some code using Collator
. This, too, doesn't know about the Polish L! So, I accept @VGR's explanation, that some data must just be missing.
import java.text.Collator
fun main() {
val s = "ŁóźÆæŒĸ"
val sx = listOf("L","o","z","AE", "ae", "OE", "k")
val c1 = Collator.getInstance()
c1.setStrength(Collator.PRIMARY)
for ((i, ch) in s.withIndex()) {
val cmp1 = c1.compare(ch.toString(), sx[i])
println("'$ch' '${sx[i]}' -> $cmp1")
}
}
Results:
'Ł' 'L' -> 1
'ó' 'o' -> 0
'ź' 'z' -> 0
'Æ' 'AE' -> 0
'æ' 'ae' -> 0
'Œ' 'OE' -> 0
'ĸ' 'k' -> 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.