简体   繁体   中英

Python website scraping python and parsing data

I'm a Python beginner and I am having trouble scraping a webpage and displaying specific text from the page.

I know my problem lies within the encoding as I have been reading unicode type and have seen other newbies having the exact same issue.

For example lets say I wanted to scrape www.amazon.com this is the code I have

import pycurl
import cStringIO
from bs4 import BeautifulSoup

buf = cStringIO.StringIO()

curl = pycurl.Curl()
curl.setopt(curl.URL, 'http://www.amazon.com')
curl.setopt(curl.WRITEFUNCTION, buf.write)
curl.perform()

result = buf.getvalue()
result = unicode(result, "ascii", errors="ignore")
buf.close()

soup = BeautifulSoup(result)
print soup.get_text()

This returns the amazon web page to the result variable. But I get the annoying error when trying to use the beautifulsoup get_text() method:

UnicodeEncodeError: 'ascii' codec can't encode character u'\…' in position 25790: ordinal not in range(128)

How do I ensure / decode the entire results of the contents returned within my curl request.

You might want to use requests instead, its simpler and cleaner and AFAIK avoids the encoding issue.

from bs4 import BeautifulSoup
import requests

resp = requests.get('http://www.amazon.com')

bsoup = BeautifulSoup(resp.text)
print(bsoup.get_text())

There are reasons to use CURL, but requests is simpler and easier in most cases and your situation doesn't look like an exception based on what you describe.

EDIT: to resolve the unicode error, try explicitly encoding the string as utf-8 (as per this SO question):

encoded = resp.text.encode('utf-8')
bsoup = BeautifulSoup(encoded)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM