简体   繁体   English

通过Beautiful Soup解析带有Unicode的HTML的麻烦

[英]Trouble with parsing HTML with unicodes through Beautiful Soup

Beautiful Soup doesn't seem to work properly(for me) in case HTML contains unicodes whose ascii exceeds 128. What suitable decoding-encoding should be used for this ? 如果HTML包含ascii超过128的unicode,Beautiful Soup似乎不能正常工作(对我而言)。对此应该使用什么合适的解码编码?

raw = open('index.html').read()
BeautifulSoup.BeautifulSoup(raw)

Error 错误

...stacktrace...
UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe9' in position 8094: ordinal not in range(128)

The problem is not with parsing the file. 问题不在于解析文件。 Using the link you gave in your comment to Marco, doing soup = BeautifulSoup(urllib.urlopen(your_link)) works absolutely fine. 使用您在评论中给Marco的链接, soup = BeautifulSoup(urllib.urlopen(your_link))绝对可以。

It's just when you try and print that parsed data to the console that you get a problem, because it's now been converted to Unicode, and Python will try and output that as ASCII unless you tell it otherwise. 只是当您尝试将已解析的数据打印到控制台时,您会遇到问题,因为现在已将其转换为Unicode,Python会尝试将其输出为ASCII,除非您另有说明。 So doing print soup rather than just soup in your console will work. 因此,在控制台中执行print soup而不只是soup是可行的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM