unicode normalization: dotless i + accent

Question

Let's combine a regular i with a combining acute accent, and normalize the result (using Python's unicodedata.normalize ):

from unicodedata import normalize

normalize("NFC", "i\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")

b'\\N{LATIN SMALL LETTER I WITH ACUTE}'

As expected: a small i with the dot swapped out for an acute accent, í .

Let's do the same with a dotless i:

from unicodedata import normalize

normalize("NFC", "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")

b'\\N{LATIN SMALL LETTER DOTLESS I}\\N{COMBINING ACUTE ACCENT}'

As you can see, it does not combine. Other implementations, eg, this one , do the same.

Why not? Is this consistent with the Unicode standard ?

Answer 1

From The Unicode Standard, Version 14.0 , Diacritics on i and j (highlighting by myself):

A dotted (normal) i or j followed by some common nonspacing marks above loses the dot in rendering. Thus, in the word naïve, the ï could be spelled with i + diaeresis. A dotted-i is not equivalent to a Turkish dotless-i + overdot, nor are other cases of accented dotted-i equivalent to accented dotless-i (for example, i + ¨ ≠ ı + ¨).

unicode normalization: dotless i + accent

Question

1 answers

solution1
0 ACCPTED 2022-06-16 10:34:42

unicode normalization: dotless i + accent

Question

1 answers

solution1 0 ACCPTED 2022-06-16 10:34:42

solution1
0 ACCPTED 2022-06-16 10:34:42