[英]How many displayable characters in a unicode string (Japanese / Chinese)
I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters. 我需要知道一个包含日文/中文字符的unicode字符串中有多少个可显示字符。
Sample code to make the question very obvious : 示例代码使问题非常明显:
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12 12
print str
睡眠時間 <<< note that four characters are displayed 睡眠时间<<<请注意,显示四个字符
How can i know, from the string, that 4 characters are going to be displayed ? 我如何从字符串中知道将要显示4个字符?
This string 这个字符串
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points . 是unicode代码点的编码表示。 It contain bytes, len(str) returns you amount of bytes .
它包含字节,len(str)返回您的字节数 。
You want to know, how many unicode codes contains the string. 您想知道,该字符串包含多少个unicode代码。 For that, you need to know, what encoding was used to encode those unicode codes.
为此,您需要知道什么编码用于编码那些unicode代码。 The most popular encoding is utf8 .
最受欢迎的编码是utf8 。 In utf8 encoding, one unicode code point can take from 1 to 6 bytes.
在utf8编码中,一个unicode码点可以占用1到6个字节。 But you must not remember that, just decode the string:
但您一定不要忘记,只需解码字符串即可:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points. 在这里您可以看到4个unicode点。 Print it, to see printable version:
打印,以查看可打印版本:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes: 并获取大量的unicode代码:
>>> len(str.decode('utf8'))
4
UPDATE : Look also at abarnert answer to respect all possible cases. 更新 :也请参阅abarnert答案以尊重所有可能的情况。
If you actually want "displayable characters", you have to do two things. 如果您实际上想要“可显示字符”,则必须做两件事。
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk: 首先,您必须将字符串从UTF-8转换为Unicode,如stalk所述:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. 接下来,您必须过滤掉所有不代表可显示字符的代码点。 You can use the
unicodedata
module for this. 您可以为此使用
unicodedata
模块。 The category
function can give you the general category of any code unit. category
功能可以为您提供任何代码单元的常规类别。 To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata
docs. 要理解这些类别,请查看从您的Python的
unicodedata
文档版本链接的Unicode字符数据库版本中的General Categories表 。
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". 对于使用UCD 5.2.0的Python 2.7.8,您必须做一些解释才能确定什么算作“可显示”,因为Unicode确实没有与“可显示”相对应的任何内容。 But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is.
但是,假设您已决定所有控件,格式,专用字符和未分配字符均不可显示,而其他所有内容均不可显示。 Then you'd write:
然后你会写:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is: 或者,如果您确定Mn和Me也不能“显示”,但是Mc是:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. 但这甚至可能不是您想要的。 For example, does a nonspacing combining mark followed by a letter count as one character or two?
例如,不带空格的组合标记后跟一个字母是否算作一个字符或两个字符? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point).
标准示例为U + 0043加U + 0327:两个代码点组成一个字符Ç(但U + 00C7在单个代码点中也是同一字符)。 Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want.
通常,只要适当地规范化了您的字符串(通常意味着NKFC或NKFD)就足以解决该问题-只要您知道想要什么答案即可。 Until you can answer that, of course, nobody can tell you how to do it.
当然,在您无法回答之前,没有人可以告诉您如何做。
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. 如果您在想“这很糟糕,应该对'printable'的含义进行正式定义,而Python应该知道该定义”,那么,您只需要使用更新版本的Python。 In 3.x, you can just write:
在3.x中,您可以编写:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". 但是,当然只有在其对“可打印”的定义恰好与您所指的“可显示”含义相匹配时,该方法才有效。 And it very well may not—for example, they consider all separators except
' '
non-printable. 而且也可能不是,例如,他们认为除
' '
以外' '
所有分隔符都是不可打印的。 Obviously they can't include definitions for any distinction anyone might want to make. 显然,它们不能包含任何人可能要做出的任何区分的定义。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.