Python网站抓取python并解析数据

Question

我是Python初学者，在抓取网页并显示页面中的特定文本时遇到了麻烦。

我知道我的问题出在编码方面，因为我一直在阅读unicode类型，并且看到其他新手也遇到了完全相同的问题。

例如，假设我想抓取www.amazon.com，这是我拥有的代码

import pycurl
import cStringIO
from bs4 import BeautifulSoup

buf = cStringIO.StringIO()

curl = pycurl.Curl()
curl.setopt(curl.URL, 'http://www.amazon.com')
curl.setopt(curl.WRITEFUNCTION, buf.write)
curl.perform()

result = buf.getvalue()
result = unicode(result, "ascii", errors="ignore")
buf.close()

soup = BeautifulSoup(result)
print soup.get_text()

这会将亚马逊网页返回到结果变量。 但是，当尝试使用beautifulsoup get_text（）方法时，出现了烦人的错误：

UnicodeEncodeError：'ascii'编解码器无法在位置25790编码字符u'\\ u2026'：序数不在范围内（128）

如何确保/解码curl请求中返回的内容的全部结果。

Answer 1

您可能想使用请求，它更简单，更简洁，并且AFAIK避免了编码问题。

from bs4 import BeautifulSoup
import requests

resp = requests.get('http://www.amazon.com')

bsoup = BeautifulSoup(resp.text)
print(bsoup.get_text())

有使用CURL的理由，但是在大多数情况下，请求变得越来越简单，并且根据您的描述，您的情况看起来也不是例外。

编辑：要解决unicode错误，请尝试将字符串明确编码为utf-8（根据此 SO问题）：

encoded = resp.text.encode('utf-8')
bsoup = BeautifulSoup(encoded)

Python网站抓取python并解析数据

问题描述

1 个解决方案

解决方案1
4 2014-02-13 22:16:35

Python网站抓取python并解析数据

问题描述

1 个解决方案

解决方案1 4 2014-02-13 22:16:35

解决方案1
4 2014-02-13 22:16:35