Python 3 len() function for Unicode characters

Question

When we believe Python 3 got everything right on Unicode I am surprised while I faced this situation.

>>> amma = "அம்மா"
>>> amma
'அம்மா'
>>> len(amma)
5

Apparently the Tamil string "அம்மா" has 3 letters, A return value of 5 for len("அம்மா") in no way can be accepted or appreciated.

How are the other Dravidian or Brahmic scripts solve this issue to get the right string length?

Edit #1: Considering the comment of @joey this question can be rephrased as below.

How to calculate the grapheme length in Python?

We know Swift or Perl6 does this by default

  2> let amma = "அம்மா".characters.count
amma: Distance = 3

Answer 1

It may have 3 letters, but it has 5 characters:

$ charinfo 'அம்மா'
U+0B85 TAMIL LETTER A [Lo]
U+0BAE TAMIL LETTER MA [Lo]
U+0BCD TAMIL SIGN VIRAMA [Mn]
U+0BAE TAMIL LETTER MA [Lo]
U+0BBE TAMIL VOWEL SIGN AA [Mc]

If you need to be more specific then you will need to only count the number of characters that are in the Letter category.

Answer 2

Below code only count the characters and ignores unicode marks (using standard re module).

import re
amma = "அம்மா"
len(re.findall("[ஃ-ஹ]", amma))

Below is the fastest way to get letters counts in unicode (using the third-party regex module).

import regex
amma = "அம்மா"
len(regex.findall('\p{L}\p{M}*', amma))

Answer 3

Package

pip install Open-Tamil

Code

from tamil import utf8
amma = "அம்மா"
letters = utf8.get_letters(amma)
print(len(letters))

Python 3 len() function for Unicode characters

Question

3 answers

solution1
2 2016-01-27 10:23:39

solution2
1 2020-07-24 12:51:13

solution3
0 2020-07-24 07:50:52

Python 3 len() function for Unicode characters

Question

3 answers

solution1 2 2016-01-27 10:23:39

solution2 1 2020-07-24 12:51:13

solution3 0 2020-07-24 07:50:52

solution1
2 2016-01-27 10:23:39

solution2
1 2020-07-24 12:51:13

solution3
0 2020-07-24 07:50:52