I'm looking to standardize some unicode text in python. I'm wondering if there's an easy way to get the "denormalized" form of a combining unicode character in python? eg if I have the sequence u'o\\xaf'
(ie latin small letter o
followed by combining macron
), to get ō ( latin small letter o with macron
). It's easy to go the other way:
o = unicodedata.lookup("LATIN SMALL LETTER O WITH MACRON")
o = unicodedata.normalize('NFD', o)
o = unicodedata.normalize('NFC', o)
As I have commented, U+00AF is not a combining macron. But you can convert it into U+0020 U+0304 with an NFKD transform.
>>> unicodedata.normalize('NFKD', u'o\u00af')
u'o \u0304'
Then you could remove the space and get ō with NFC.
(Note that NFKD is quite aggressive on decomposition in a way that some semantics can be lost — anything that is "compatible" will be separated out. eg
'½'
(U+008D) ↦ '1'
'⁄'
(U+2044) '2'
; '²'
(U+00B2) ↦ '2'
'①'
(U+2460) ↦ '1'
etc.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.