简体   繁体   中英

Is there a way to know whether a Unicode string contains any Chinese/Japanese character in Python?

I have a Unicode string in Python. I am looking for a way to determine if there is any Chinese/Japanese character in the string. If possible it'll be better to be able to locate those characters.

It seems this is a bit different from a language detection problem. My string can be a mixture of English and Chinese texts.

My code has Internet access.

You can use the Unicode Script property to determine what script they are commonly associated with.

Python's unicodedata module, sadly, does not have this property. However, a number of third-party modules, such as unicodedata2 and unicodescript do have this information. You can query them and check to see if you have any characters in the Han script, which corresponds to Chinese (and Kanji, and Hanja).

I tried Python's unicodedata module mentioned by nneonneo in his answer and I think it probably works.

>>> import unicodedata
>>> unicodedata.name('你')
'CJK UNIFIED IDEOGRAPH-4F60'
>>> unicodedata.name('桜')
'CJK UNIFIED IDEOGRAPH-685C'
>>> unicodedata.name('あ')
'HIRAGANA LETTER A'
>>> unicodedata.name('ア')
'KATAKANA LETTER A'
>>> unicodedata.name('a')
'LATIN SMALL LETTER A'

As you see, both Chinese characters and Japanese adopted Chinese characters are categorized to CJK UNIFIED IDEOGRAPH and hiragana and katakana correctly recognized. I didn't test Korean characters but I think they should fall into CJK UNIFIED IDEOGRAPH , too.

Also, if you only care about if it's a CJK character/letter or not, it seems this is simpler:

>>> import unicodedata
>>> unicodedata.category('你')
'Lo'
>>> unicodedata.category('桜')
'Lo'
>>> unicodedata.category('あ')
'Lo'
>>> unicodedata.category('ア')
'Lo'
>>> unicodedata.category('a')
'Ll'
>>> unicodedata.category('A')
'Lu'

According to here , Ll is lowercase, Lu is uppercase and Lo is other.

您可以使用此正则表达式[\⺀-\鿿]来匹配CJK字符。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM