I am trying to learn the basics of web scraping in python using beautiful soup. I came across code in a document. When I execute it there is an error. The code is:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())
for row in soup('table', {'class': 'mod-data’})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
and the error is:
SyntaxError: Non-ASCII character '\xe2' in file ex.py on line 4, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
please help me solve this, and explain the line
for row in soup('table', {'class': 'mod-data'})[0].tbody('tr'):
most of the sites are giving the sample code, not explaining how it came and what is the meaning. It's a bit confusing, the terms like class
, tbody
etc. It will be really helpful if you could suggest any site or ebooks or anything
You have a typo in this line:
soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())
instead of a single quote after .org you have an apostrophe
It should be something like:
soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())
Also:
You have the same issue in the following line. After mod-data change the apostrophe to a quote
Instead of just soup('table', {'class': 'mod-data'})[0].tbody('tr')
# syntax error
Try soup.find_all('table', {'class': 'mod-data'})[0].tbody('tr')
OR .findAll
for older versions of BeautifulSoup..
You should be using one of soups methods here, like .find_all()
which returns a list
Read the BeautifulSoup docs and get the latest version(4) of BeautifulSoup
The following code works for me:
import urllib2
from bs4 import BeautifulSoup # latest version bs4
soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())
for row in soup.find_all("table", {"class": "mod-data"})[0].tbody("tr"):
tds = row("td")
print tds[0].string, tds[1].string
Output:
1 Florida State
2 Auburn
3 Alabama
4 Michigan State
5 Stanford
6 Baylor
7 Ohio State
8 Missouri
9 South Carolina
10 Oregon
11 Oklahoma
12 Clemson
13 Oklahoma State
14 Arizona State
15 UCF
16 LSU
17 UCLA
18 Louisville
19 Wisconsin
20 Fresno State
21 Texas A&M;
22 Georgia
23 Northern Illinois
24 Duke
25 USC
If you are having problems using single-quotes on those lines, use double-quotes.
Try changing your fourth line from:
soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())
To:
soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())
It looks like your second single quote was different from the first, so changing to double quotes should alleviate that error.
The code you are asking about is reading from a table. In HTML each row of a table is denoted by the tag, which your program is searching for and then reading from. You are then printing the first and second column of the table you found.
Try changing your second line:
from bs4 import BeautifulSoup
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.