[英]Trouble with parsing HTML with unicodes through Beautiful Soup
Beautiful Soup doesn't seem to work properly(for me) in case HTML contains unicodes whose ascii exceeds 128. What suitable decoding-encoding should be used for this ? 如果HTML包含ascii超过128的unicode,Beautiful Soup似乎不能正常工作(对我而言)。对此应该使用什么合适的解码编码?
raw = open('index.html').read()
BeautifulSoup.BeautifulSoup(raw)
Error 错误
...stacktrace...
UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe9' in position 8094: ordinal not in range(128)
The problem is not with parsing the file. 问题不在于解析文件。 Using the link you gave in your comment to Marco, doing soup = BeautifulSoup(urllib.urlopen(your_link))
works absolutely fine. 使用您在评论中给Marco的链接, soup = BeautifulSoup(urllib.urlopen(your_link))
绝对可以。
It's just when you try and print that parsed data to the console that you get a problem, because it's now been converted to Unicode, and Python will try and output that as ASCII unless you tell it otherwise. 只是当您尝试将已解析的数据打印到控制台时,您会遇到问题,因为现在已将其转换为Unicode,Python会尝试将其输出为ASCII,除非您另有说明。 So doing print soup
rather than just soup
in your console will work. 因此,在控制台中执行print soup
而不只是soup
是可行的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.