简体   繁体   English

python 2.x中unicode字符串的string.ascii_letters相当于?

[英]An equivalent to string.ascii_letters for unicode strings in python 2.x?

In the "string" module of the standard library, 在标准库的“字符串”模块中,

string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase

is

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

Is there a similar constant which would include everything that is considered a letter in unicode? 是否有一个类似的常量,包括所有被认为是unicode中的字母的东西?

You can construct your own constant of Unicode upper and lower case letters with: 您可以使用以下内容构造自己的Unicode大写和小写字母常量:

import unicodedata as ud
all_unicode = ''.join(unichr(i) for i in xrange(65536))
unicode_letters = ''.join(c for c in all_unicode
                          if ud.category(c)=='Lu' or ud.category(c)=='Ll')

This makes a string 2153 characters long (narrow Unicode Python build). 这使得字符串长度为2153个字符(缩小的Unicode Python构建)。 For code like letter in unicode_letters it would be faster to use a set instead: 对于letter in unicode_letters代码,使用set代码会更快:

unicode_letters = set(unicode_letters)

There's no string, but you can check whether a character is a letter using the unicodedata module, in particular its category() function. 没有字符串,但您可以使用unicodedata模块检查字符是否是字母,特别是其category()函数。

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'A')
'Lu'
>>> unicodedata.category(u'5')
'Nd'
>>> unicodedata.category(u'ф') # Cyrillic f.
'Ll'
>>> unicodedata.category(u'٢') # Arabic-indic numeral for 2.
'Nd'

Ll means "letter, lowercase". Ll表示“字母,小写”。 Lu means "letter, uppercase". Lu意思是“字母,大写”。 Nd means "numeric, digit". Nd表示“数字,数字”。

That would be a pretty massive constant. 这将是一个相当大的常数。 Unicode currently covers over 100.000 different characters. Unicode目前涵盖超过100,000个不同的字符。 So the answer is no. 所以答案是否定的。

The question is why you would need it? 问题是你为什么需要它? There might be some other way of solving whatever your problem is with the unicodedata module, for example. 例如,可能有一些其他方法可以解决unicodedata模块的问题。

Update: You can download files with all unicode datapoint names and other information from ftp://ftp.unicode.org/ , and do loads of interesting stuff with that. 更新:您可以从ftp://ftp.unicode.org/下载包含所有unicode数据点名称和其他信息的文件,并使用它做大量有趣的事情。

As mentioned in previous answers, the string would indeed be way too long. 正如在以前的答案中提到,该字符串确实是太长了。 So, you have to target (a) specific language(s). 因此,您必须针对(a)特定语言。
[EDIT: I realized it was the case for my original intended use, and for most uses, I guess. [编辑:我意识到这是我原来的预期用途,对于大多数用途,我猜。 However, in the meantime, Mark Tolonen gave a good answer to the question as it was asked, so I chose his answer, although I used the following solution] 然而,与此同时,Mark Tolonen对这个问题给出了一个很好的答案,所以我选择了他的答案,尽管我使用了以下解决方案]

This is easily done with the "locale" module: 使用“locale”模块可以轻松完成此操作:

import locale
import string
code = 'fr_FR' ## Do NOT specify encoding (see below)
locale.setlocale(locale.LC_CTYPE, code)
encoding = locale.getlocale()[1]
letters = string.letters.decode(encoding)

with "letters" being a 117-character-long unicode string. “letters”是一个117个字符长的unicode字符串。

Apparently, string.letters is dependant on the default encoding for the selected language code, rather than on the language itself. 显然,string.letters依赖于所选语言代码的默认编码,而不是语言本身。 Setting the locale to fr_FR or de_DE or es_ES will update string.letters to the same value (since they are all encoded in ISO8859-1 by default). 将语言环境设置为fr_FR或de_DE或es_ES会将string.letters更新为相同的值(因为它们都默认在ISO8859-1中编码)。

If you add an encoding to the language code (de_DE.UTF-8), the default encoding will be used instead for string.letters. 如果向语言代码(de_DE.UTF-8)添加编码,则将使用默认编码而不是string.letters。 That would cause a UnicodeDecodeError if you used the rest of the above code. 如果您使用上述其余代码,那将导致UnicodeDecodeError。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM