BeautifulSoup不给我Unicode

Question

I'm using Beautiful soup to scrape data. 我正在使用美丽的汤来刮取数据。 The BS documentation states that BS should always return Unicode but I can't seem to get Unicode. BS文档声明BS应该总是返回Unicode，但我似乎无法获得Unicode。 Here's a code snippet 这是一段代码片段

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

The original data returned from the page is a string. 从页面返回的原始数据是一个字符串。 BS shows the original encoding as ISO-8859-1. BS将原始编码显示为ISO-8859-1。 I thought that BS automatically converted everything to Unicode so why is it that when I do this: 我认为BS自动将所有内容转换为Unicode，所以当我这样做时为什么呢：

table = soup.table
print type(table.renderContents())

..it gives me a string object and not Unicode? ..it给了我一个字符串对象，而不是Unicode？

How can i get a Unicode objects from BS? 如何从BS获取Unicode对象？

I'm really, really lost with this. 我真的，真的迷失了。 Any help? 有帮助吗？ Thanks in advance. 提前致谢。

Answer 1

As you may have noticed renderContent returns (by default) a string encoded in UTF-8, but if you really want a Unicode string representing the entire document you can also do unicode(soup) or decode the output of renderContents/prettify using unicode(soup.prettify(), "utf-8"). 您可能已经注意到renderContent返回（默认情况下）以UTF-8编码的字符串，但如果您真的想要一个表示整个文档的Unicode字符串，您还可以使用unicode执行unicode（汤）或解码renderContents / prettify的输出（ soup.prettify（），“utf-8”）。

Related 有关

How to render contents of a tag in unicode in BeautifulSoup? 如何在BeautifulSoup中以unicode呈现标记的内容？

Answer 2

originalEncoding is exactly that - the source encoding, so the fact that BS is storing everything as unicode internally won't change that value. originalEncoding就是 - 源编码，因此BS在内部将所有内容存储为unicode这一事实不会改变该值。 When you walk the tree, all text nodes are unicode, all tags are in unicode, etc., unless you otherwise convert them (say by using print , str , prettify , or renderContents ). 当你走树时，所有文本节点都是unicode，所有标签都是unicode等，除非你另外转换它们（比如使用print ， str ， prettify或renderContents ）。

Try doing something like: 尝试做类似的事情：

soup = BeautifulSoup(data)
print type(soup.contents[0])

Unfortunately everything else you've done up to this point has found the very few methods in BS that convert to strings. 不幸的是，到目前为止你所做的其他事情都发现了BS中转换为字符串的极少数方法。

BeautifulSoup不给我Unicode

问题描述

2 个解决方案

解决方案1
4 2010-08-10 20:53:24

解决方案2
2 2010-07-07 07:20:51

BeautifulSoup不给我Unicode

问题描述

2 个解决方案

解决方案1 4 2010-08-10 20:53:24

解决方案2 2 2010-07-07 07:20:51

解决方案1
4 2010-08-10 20:53:24

解决方案2
2 2010-07-07 07:20:51