[英]python reading unicode characters from html
I have this script, which reads the text from web page: 我有此脚本,该脚本从网页读取文本:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page);
paragraphs = soup.findAll('p');
for p in paragraphs:
content = content+p.text+" ";
In the web page I have this string: 在网页中,我有以下字符串:
Möddinghofe
My script reads it as: 我的脚本将其读取为:
Möddinghofe
How can I read it as it is? 我该如何原样阅读?
Hope this would help you 希望这对您有帮助
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
reference: Convert HTML entities to Unicode and vice versa 参考: 将HTML实体转换为Unicode,反之亦然
我建议您看一看BeautifulSoup文档的编码部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.