简体   繁体   中英

Converting unicode to Chinese

I am trying to get some Chinese text off a website online using python. When I get, it is surrounded by html tags and is like this:

我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.

(I had to put it as code to prevent the html tags from disappearing) However, once I use slicing to get rid of the html tags, I get:

我今天的心情ᄌヘ好。

Why is this weird character appearing in the second to last spot? Thank you for your help!

Using regex module, you can use unicode category \\p{Han} to filter Chinese characters:

>>> text = u'''我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.'''
>>> import regex
>>> print u''.join(regex.findall(r'\p{Han}+', text, flags=regex.UNICODE))
我今天的心情不好

Or, using unicodedata.name :

>>> import unicodedata
>>> unicodedata.name(u'a')
'LATIN SMALL LETTER A'
>>> unicodedata.name(u'我')
'CJK UNIFIED IDEOGRAPH-6211'
>>> unicodedata.name(u'今')
'CJK UNIFIED IDEOGRAPH-4ECA'

>>> text = u'''我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.'''
>>> print u''.join(c for c in text if unicodedata.name(c).startswith('CJK'))
我今天的心情不好

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM