简体   繁体   中英

unicode normalization: dotless i + accent

Let's combine a regular i with a combining acute accent, and normalize the result (using Python's unicodedata.normalize ):

from unicodedata import normalize

normalize("NFC", "i\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")
b'\\N{LATIN SMALL LETTER I WITH ACUTE}'

As expected: a small i with the dot swapped out for an acute accent, í .

Let's do the same with a dotless i:

from unicodedata import normalize

normalize("NFC", "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")
b'\\N{LATIN SMALL LETTER DOTLESS I}\\N{COMBINING ACUTE ACCENT}'

As you can see, it does not combine. Other implementations, eg, this one , do the same.

Why not? Is this consistent with the Unicode standard ?

From The Unicode Standard, Version 14.0 , Diacritics on i and j (highlighting by myself):

A dotted (normal) i or j followed by some common nonspacing marks above loses the dot in rendering. Thus, in the word naïve, the ï could be spelled with i + diaeresis. A dotted-i is not equivalent to a Turkish dotless-i + overdot, nor are other cases of accented dotted-i equivalent to accented dotless-i (for example, i + ¨ ≠ ı + ¨).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM