简体   繁体   English

Unicode字符串中有多少个可显示字符(日语/中文)

[英]How many displayable characters in a unicode string (Japanese / Chinese)

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters. 我需要知道一个包含日文/中文字符的unicode字符串中有多少个可显示字符。

Sample code to make the question very obvious : 示例代码使问题非常明显:

# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)

12 12

print str

睡眠時間 <<< note that four characters are displayed 睡眠时间<<<请注意,显示四个字符

How can i know, from the string, that 4 characters are going to be displayed ? 我如何从字符串中知道将要显示4个字符?

This string 这个字符串

str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'

Is an encoded representation of unicode code points . unicode代码点的编码表示。 It contain bytes, len(str) returns you amount of bytes . 它包含字节,len(str)返回您的字节数

You want to know, how many unicode codes contains the string. 您想知道,该字符串包含多少个unicode代码。 For that, you need to know, what encoding was used to encode those unicode codes. 为此,您需要知道什么编码用于编码那些unicode代码。 The most popular encoding is utf8 . 最受欢迎的编码是utf8 In utf8 encoding, one unicode code point can take from 1 to 6 bytes. 在utf8编码中,一个unicode码点可以占用1到6个字节。 But you must not remember that, just decode the string: 但您一定不要忘记,只需解码字符串即可:

>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'

Here you can see 4 unicode points. 在这里您可以看到4个unicode点。 Print it, to see printable version: 打印,以查看可打印版本:

>>> print str.decode('utf8')
睡眠時間

And get amount of unicode codes: 并获取大量的unicode代码:

>>> len(str.decode('utf8'))
4

UPDATE : Look also at abarnert answer to respect all possible cases. 更新 :也请参阅abarnert答案以尊重所有可能的情况。

If you actually want "displayable characters", you have to do two things. 如果您实际上想要“可显示字符”,则必须做两件事。

First, you have to convert the string from UTF-8 to Unicode, as explained by stalk: 首先,您必须将字符串从UTF-8转换为Unicode,如stalk所述:

s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')

Next, you have to filter out all code points that don't represent displayable characters. 接下来,您必须过滤掉所有不代表可显示字符的代码点。 You can use the unicodedata module for this. 您可以为此使用unicodedata模块。 The category function can give you the general category of any code unit. category功能可以为您提供任何代码单元的常规类别。 To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs. 要理解这些类别,请查看从您的Python的unicodedata文档版本链接的Unicode字符数据库版本中的General Categories表

For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". 对于使用UCD 5.2.0的Python 2.7.8,您必须做一些解释才能确定什么算作“可显示”,因为Unicode确实没有与“可显示”相对应的任何内容。 But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. 但是,假设您已决定所有控件,格式,专用字符和未分配字符均不可显示,而其他所有内容均不可显示。 Then you'd write: 然后你会写:

def displayable(c):
    return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))

Or, if you've decided that Mn and Me are also not "displayable" but Mc is: 或者,如果您确定Mn和Me也不能“显示”,但是Mc是:

def displayable(c):
    return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}

But even this may not be what you want. 但这甚至可能不是您想要的。 For example, does a nonspacing combining mark followed by a letter count as one character or two? 例如,不带空格的组合标记后跟一个字母是否算作一个字符或两个字符? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). 标准示例为U + 0043加U + 0327:两个代码点组成一个字符Ç(但U + 00C7在单个代码点中也是同一字符)。 Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. 通常,只要适当地规范化了您的字符串(通常意味着NKFC或NKFD)就足以解决该问题-只要您知道想要什么答案即可。 Until you can answer that, of course, nobody can tell you how to do it. 当然,在您无法回答之前,没有人可以告诉您如何做。


If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. 如果您在想“这很糟糕,应该对'printable'的含义进行正式定义,而Python应该知道该定义”,那么,您只需要使用更新版本的Python。 In 3.x, you can just write: 在3.x中,您可以编写:

p = ''.join(c for c in u is c.isprintable())

But of course that only works if their definition of "printable" happens to match what you mean by "displayable". 但是,当然只有在其对“可打印”的定义恰好与您所指的“可显示”含义相匹配时,该方法才有效。 And it very well may not—for example, they consider all separators except ' ' non-printable. 而且也可能不是,例如,他们认为除' '以外' '所有分隔符都是不可打印的。 Obviously they can't include definitions for any distinction anyone might want to make. 显然,它们不能包含任何人可能要做出的任何区分的定义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将Unicode字符串转换为中文字符 - Converting Unicode string to Chinese characters 如何在python 2.6中打印汉字的unicode字符串? - how to print unicode string of chinese characters in python 2.6? 如何使键盘输入为unicode/japanese字符? - How to make keyboard Input into unicode/japanese characters? 打印出日文(中文)字符 - Printing out Japanese (Chinese) characters 如何在Python中替换unicode中文字符? - How to replace unicode Chinese characters in Python? 使用 Python OpenCV 在图像路径(中文、日文、韩文)中读取/加载带有 unicode 字符的图像 - Read/load images with unicode characters in image path (Chinese, Japanese, Korean) with Python OpenCV 有没有办法知道Unicode字符串是否包含Python中的任何中文/日文字符? - Is there a way to know whether a Unicode string contains any Chinese/Japanese character in Python? 如何在Python中找到字符串中的中文或日文字符? - How to find out Chinese or Japanese Character in a String in Python? Python正则表达式搜索中/日字符 - Python Regex Search Chinese/Japanese Characters 如何填充中文/日文字符,使它们与普通字符对齐,在 .srt 文件中具有相同的宽度(python) - How to pad Chinese/Japanese characters so that they are alligned with normal characters, have the same width in .srt file (python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM