简体   繁体   English

Python:如何检查unicode字符串是否包含一个cased字符?

[英]Python: How to check if a unicode string contains a cased character?

I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). 我正在做一个过滤器,其中我检查unicode(utf-8编码)字符串是否不包含大写字符(在所有语言中)。 It's fine with me if the string doesn't contain any cased character at all. 如果字符串根本不包含任何套接字符,那对我来说没问题。

For example: 'Hello!' 例如:'你好!' will not pass the filter, but "!" 不会通过过滤器,但“!” should pass the filter, since "!" 应该通过过滤器,因为“!” is not a cased character. 不是一个套装的角色。

I planned to use the islower() method, but in the example above, "!".islower() will return False. 我打算使用islower()方法,但在上面的例子中,“!”。islower()将返回False。

According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False." 根据Python Docs,“如果unicode字符串的套接字符全部为小写且字符串包含至少一个套接字符,则python unicode方法islower()返回True,否则返回False。”

Since the method also returns False when the string doesn't contain any cased character, ie. 由于当字符串不包含任何套接字符时,该方法也返回False,即。 "!", I want to do check if the string contains any cased character at all. “!”,我想检查字符串是否包含任何套接字符。

Something like this.... 像这样......

string = unicode("!@#$%^", 'utf-8')

#check first if it contains cased characters
if not contains_cased(string):
     return True

return string.islower():

Any suggestions for a contains_cased() function? 有关contains_cased()函数的任何建议吗?

Or probably a different implementation approach? 或者可能采用不同的实施方法?

Thanks! 谢谢!

import unicodedata as ud

def contains_cased(u):
  return any(ud.category(c)[0] == 'L' for c in u)

Here is the full scoop on Unicode character categories. 以下是Unicode字符类别的完整独家新闻。

Letter categories include: 信件类别包括:

Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other

Note that Ll <-> islower() ; 注意, Ll <-> islower() ; similarly for Lu ; Lu ; (Lu or Lt) <-> istitle()

You may wish to read the complicated discussion on casing, which includes some discussion of Lm letters. 您可能希望阅读关于套管的复杂讨论,其中包括对Lm字母的一些讨论。

Blindly treating all "letters" as cased is demonstrably wrong. 盲目地将所有“信件”视为套管是明显错误的。 The Lo category includes 45301 codepoints in the BMP (counted using Python 2.6). Lo类别包括BMP中的45301个代码点(使用Python 2.6计算)。 A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased". 其中很大一部分是Hangul Syllables,CJK表意文字和其他东亚人物 - 很难理解他们如何被视为“套装”。

You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. 您可能希望根据您期望的“套管字符”的(未指定的)行为来考虑替代定义。 Here's a simple first attempt: 这是一个简单的第一次尝试:

>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>

Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean. 有趣的是,有1216 x Ll和937 x Lu,总共2153 ...进一步研究Ll和Lu的真正意义。

use module unicodedata , 使用模块unicodedata

unicodedata.category(character)

returns " Ll " for lowercase letters and " Lu " for uppercase ones. 对于小写字母返回“ Ll ”,对于大写字母返回“ Lu ”。

here you can find list of unicode character categories 在这里,您可以找到unicode字符类别列表

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM