简体   繁体   中英

Efficient way to check if unicode string is NFC in Python?

I want to check if a string is already in NFC form. Currently I do:

unicodedata.normalize('NFC', s) == s

I am doing this for a large number of strings, so I would like to be efficient. The above method seems wasteful. It converts to NFC, and then does a string comparison.

Is there a more efficient way to do it? I have considered:

len(unicodedata.normalize('NFC', s)) == len(s)

This avoids the string comparison. But I am not sure this is always correct. This works if NFC normalization always changes the length of a non NFC string. Is that a valid assumption?

Any other ideas?

Normalising doesn't necessarily change the length of a string. For example, 'Ω' (U+2126) becomes 'Ω' (U+03A9) after NFC.

There is a normalisation "quick check" property in the Unicode database to test whether a character is already normalised, but unfortunately Python's unicodedata module doesn't expose it. However, unicodedata.normalize() does use this property to avoid doing any extra work if the string is already normalised—it simply returns the input string.

To access this property, you will either need to compile a table yourself from the Unicode Character Database, or use a broader Unicode library with Python bindings (like PyICU ).

Since Python 3.8 it exposes the needed check. Quote from the Python docs:

unicodedata. is_normalized (form, unistr)

Return whether the Unicode string unistr is in the normal form 'form'. Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.

 New in version 3.8.

I wanted everything to be in NFC, but checking for NFD (so i could convert only those) did not work: all NFC strings passed the NFD check! My solution was then to test if a string is not NFC, and if so then do the conversion.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM