简体繁体 English

通过Beautiful Soup解析带有Unicode的HTML的麻烦

[英]Trouble with parsing HTML with unicodes through Beautiful Soup

原文 2011-10-14 14:43:07 7 1 python/ regex/ html-parsing/ beautifulsoup

Beautiful Soup doesn't seem to work properly(for me) in case HTML contains unicodes whose ascii exceeds 128. What suitable decoding-encoding should be used for this ? 如果HTML包含ascii超过128的unicode，Beautiful Soup似乎不能正常工作（对我而言）。对此应该使用什么合适的解码编码？

raw = open('index.html').read() BeautifulSoup.BeautifulSoup(raw)

Error 错误

...stacktrace... UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe9' in position 8094: ordinal not in range(128)

1 个解决方案

The problem is not with parsing the file. 问题不在于解析文件。 Using the link you gave in your comment to Marco, doing soup = BeautifulSoup(urllib.urlopen(your_link)) works absolutely fine. 使用您在评论中给Marco的链接， soup = BeautifulSoup(urllib.urlopen(your_link))绝对可以。

It's just when you try and print that parsed data to the console that you get a problem, because it's now been converted to Unicode, and Python will try and output that as ASCII unless you tell it otherwise. 只是当您尝试将已解析的数据打印到控制台时，您会遇到问题，因为现在已将其转换为Unicode，Python会尝试将其输出为ASCII，除非您另有说明。 So doing print soup rather than just soup in your console will work. 因此，在控制台中执行print soup而不只是soup是可行的。