简体   繁体   中英

Python 3 len() function for Unicode characters

When we believe Python 3 got everything right on Unicode I am surprised while I faced this situation.

>>> amma = "அம்மா"
>>> amma
'அம்மா'
>>> len(amma)
5

Apparently the Tamil string "அம்மா" has 3 letters, A return value of 5 for len("அம்மா") in no way can be accepted or appreciated.

How are the other Dravidian or Brahmic scripts solve this issue to get the right string length?

Edit #1: Considering the comment of @joey this question can be rephrased as below.

How to calculate the grapheme length in Python?

We know Swift or Perl6 does this by default

  2> let amma = "அம்மா".characters.count
amma: Distance = 3

It may have 3 letters, but it has 5 characters:

$ charinfo 'அம்மா'
U+0B85 TAMIL LETTER A [Lo]
U+0BAE TAMIL LETTER MA [Lo]
U+0BCD TAMIL SIGN VIRAMA [Mn]
U+0BAE TAMIL LETTER MA [Lo]
U+0BBE TAMIL VOWEL SIGN AA [Mc]

If you need to be more specific then you will need to only count the number of characters that are in the Letter category.

Below code only count the characters and ignores unicode marks (using standard re module).

import re
amma = "அம்மா"
len(re.findall("[ஃ-ஹ]", amma))

Below is the fastest way to get letters counts in unicode (using the third-party regex module).

import regex
amma = "அம்மா"
len(regex.findall('\p{L}\p{M}*', amma))

Package

pip install Open-Tamil

Code

from tamil import utf8
amma = "அம்மா"
letters = utf8.get_letters(amma)
print(len(letters))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM