Converting unicode to Chinese

Question

I am trying to get some Chinese text off a website online using python. When I get, it is surrounded by html tags and is like this:

我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.

(I had to put it as code to prevent the html tags from disappearing) However, once I use slicing to get rid of the html tags, I get:

我今天的心情ﾸﾍ好。

Why is this weird character appearing in the second to last spot? Thank you for your help!

Answer 1

Using regex module, you can use unicode category \\p{Han} to filter Chinese characters:

>>> text = u'''我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.'''
>>> import regex
>>> print u''.join(regex.findall(r'\p{Han}+', text, flags=regex.UNICODE))
我今天的心情不好

Or, using unicodedata.name :

>>> import unicodedata
>>> unicodedata.name(u'a')
'LATIN SMALL LETTER A'
>>> unicodedata.name(u'我')
'CJK UNIFIED IDEOGRAPH-6211'
>>> unicodedata.name(u'今')
'CJK UNIFIED IDEOGRAPH-4ECA'

>>> text = u'''我今天的<em class="hot">心情</em>不好。<br/> I'm feeling blue today.'''
>>> print u''.join(c for c in text if unicodedata.name(c).startswith('CJK'))
我今天的心情不好

Converting unicode to Chinese

Question

1 answers

solution1
0 ACCPTED 2016-02-25 05:25:14

Converting unicode to Chinese

Question

1 answers

solution1 0 ACCPTED 2016-02-25 05:25:14

solution1
0 ACCPTED 2016-02-25 05:25:14