简体   繁体   中英

How many displayable characters in a unicode string (Japanese / Chinese)

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.

Sample code to make the question very obvious :

# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)

12

print str

睡眠時間 <<< note that four characters are displayed

How can i know, from the string, that 4 characters are going to be displayed ?

This string

str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'

Is an encoded representation of unicode code points . It contain bytes, len(str) returns you amount of bytes .

You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8 . In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:

>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'

Here you can see 4 unicode points. Print it, to see printable version:

>>> print str.decode('utf8')
睡眠時間

And get amount of unicode codes:

>>> len(str.decode('utf8'))
4

UPDATE : Look also at abarnert answer to respect all possible cases.

If you actually want "displayable characters", you have to do two things.

First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:

s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')

Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.

For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:

def displayable(c):
    return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))

Or, if you've decided that Mn and Me are also not "displayable" but Mc is:

def displayable(c):
    return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}

But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.


If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:

p = ''.join(c for c in u is c.isprintable())

But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM