[英]Printing unicode character NAMES - e.g. 'GREEK SMALL LETTER ALPHA' - instead of 'α'
I am testing function isprintable()
.我正在测试函数isprintable()
。 I want to print the Unicode NAMES of all characters in string string.whitespace + unicodedata.lookup("GREEK SMALL LETTER ALPHA")
.我想打印字符串string.whitespace + unicodedata.lookup("GREEK SMALL LETTER ALPHA")
中所有字符的 Unicode NAMES 。
How to print the all the names - eg 'SPACE', 'NO-BREAK SPACE', HORIZONTAL TAB, 'GREEK SMALL LETTER ALPHA.如何打印所有名称 - 例如“SPACE”、“NO-BREAK SPACE”、“水平标签”、“希腊小写字母 ALPHA”。
import unicodedata, string
for e in string.whitespace + unicodedata.lookup("GREEK SMALL LETTER ALPHA"):
print(ord(e))
print(unicodedata.name(e))
I get error 'ValueError: no such name'我收到错误“ValueError:没有这样的名字”
32
SPACE
9
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
ValueError: no such name
As comments indicate, the Unicode database doesn't have names for every character, but NameAliases.txt
does.正如注释所示,Unicode 数据库没有为每个字符命名,但NameAliases.txt
有。 Below parses that file and returns an alias if it exists.下面解析该文件并返回一个别名(如果存在)。 In this case, the first one found in the file:在这种情况下,在文件中找到的第一个:
import string
import requests
import unicodedata as ud
# Pull the official NameAliases.txt from the matching Unicode database
# the current Python was built with.
response = requests.get(f'http://www.unicode.org/Public/{ud.unidata_version}/ucd/NameAliases.txt')
# Parse NameAliases.txt, storing the first instance of a code and a name
aliases = {}
for line in response.text.splitlines():
if not line.strip() or line.startswith('#'):
continue
code,name,_ = line.split(';')
val = chr(int(code,16))
if val not in aliases:
aliases[val] = name
# Return the first alias from NameAliases.txt if it exists when unicodedata.name() fails.
def name(c):
try:
return ud.name(c)
except ValueError:
return aliases.get(c,'<no name>')
for e in string.whitespace + ud.lookup("GREEK SMALL LETTER ALPHA"):
print(f'U+{ord(e):04X} {name(e)}')
Output:输出:
U+0020 SPACE
U+0009 CHARACTER TABULATION
U+000A LINE FEED
U+000D CARRIAGE RETURN
U+000B LINE TABULATION
U+000C FORM FEED
U+03B1 GREEK SMALL LETTER ALPHA
As mentioned in the in this Q&A linked by wjandrea in the comments , ASCII control characters do not have official names in the current Unicode standard, so you get a ValueError when you try to look them up.正如wjandrea在评论中链接的这个问答中提到的,ASCII 控制字符在当前的 Unicode 标准中没有正式名称,因此当您尝试查找它们时会得到 ValueError。
The curses.ascii
module in the standard library provides a list of two character "names" for these characters, corresponding to the name listed in the Char column in the ASCII table (as displayed by man ascii
), but without the description.标准库中的curses.ascii
模块为这些字符提供了两个字符“名称”的列表,对应于 ASCII 表中 Char 列中列出的名称(由man ascii
显示),但没有描述。
So we can do this所以我们可以这样做
import string
import unicodedata
from curses.ascii import controlnames
for e in (string.whitespace + "\N{GREEK SMALL LETTER ALPHA}"):
try:
name = unicodedata.name(e)
except ValueError:
name = controlnames[ord(e)]
print(name)
giving this result给出这个结果
SPACE
HT
LF
CR
VT
FF
GREEK SMALL LETTER ALPHA
which is not ideal, but may be the best that can be done without using external resources, as done in this excellent answer.这并不理想,但可能是在不使用外部资源的情况下可以做到的最好的答案,就像在这个优秀的答案中所做的那样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.