简体   繁体   中英

python web scraping code error

I am trying to learn the basics of web scraping in python using beautiful soup. I came across code in a document. When I execute it there is an error. The code is:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())

for row in soup('table', {'class': 'mod-data’})[0].tbody('tr'):
  tds = row('td')
  print tds[0].string, tds[1].string

and the error is:

SyntaxError: Non-ASCII character '\xe2' in file ex.py on line 4, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

please help me solve this, and explain the line

for row in soup('table', {'class': 'mod-data'})[0].tbody('tr'):

most of the sites are giving the sample code, not explaining how it came and what is the meaning. It's a bit confusing, the terms like class , tbody etc. It will be really helpful if you could suggest any site or ebooks or anything

You have a typo in this line:

soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())

instead of a single quote after .org you have an apostrophe

It should be something like:

soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())

Also:

You have the same issue in the following line. After mod-data change the apostrophe to a quote

Instead of just soup('table', {'class': 'mod-data'})[0].tbody('tr') # syntax error

Try soup.find_all('table', {'class': 'mod-data'})[0].tbody('tr')

OR .findAll for older versions of BeautifulSoup..

You should be using one of soups methods here, like .find_all() which returns a list

Read the BeautifulSoup docs and get the latest version(4) of BeautifulSoup

The following code works for me:

import urllib2
from bs4 import BeautifulSoup # latest version bs4

soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())

for row in soup.find_all("table", {"class": "mod-data"})[0].tbody("tr"):
    tds = row("td")
    print tds[0].string, tds[1].string

Output:

1 Florida State
2 Auburn
3 Alabama
4 Michigan State
5 Stanford
6 Baylor
7 Ohio State
8 Missouri
9 South Carolina
10 Oregon
11 Oklahoma
12 Clemson
13 Oklahoma State
14 Arizona State
15 UCF
16 LSU
17 UCLA
18 Louisville
19 Wisconsin
20 Fresno State
21 Texas A&M;
22 Georgia
23 Northern Illinois
24 Duke
25 USC

If you are having problems using single-quotes on those lines, use double-quotes.

Try changing your fourth line from:

soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())

To:

soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())

It looks like your second single quote was different from the first, so changing to double quotes should alleviate that error.

The code you are asking about is reading from a table. In HTML each row of a table is denoted by the tag, which your program is searching for and then reading from. You are then printing the first and second column of the table you found.

Try changing your second line:

from bs4 import BeautifulSoup

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM