简体   繁体   中英

Trouble scraping with BeautifulSoup

I'm trying to do some scraping and I'm stuck on a basic problem (I guess?)

Here's my script so far:

from requests import get
from bs4 import BeautifulSoup

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

response = get(url)

soup = BeautifulSoup(response.text, 'html.parser')


movies_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')

names = []
years = []
imdb_ratings = []
metascores = []
votes = []
#gross=[] #many movies have no record
movie_description=[]
movie_duration=[]
movie_genre=[]


for container in movies_containers:
    if container.find_all('div', class_ = 'ratings-metascore') is not None:

        name = container.find('h3', class_ = 'lister-item-header').a.text
        names.append(name)

        year = container.h3.find('span', class_ = 'lister-item-year text-muted unbold').text
        year = year.replace('(', ' ')
        year = year.replace(')', ' ')
        years.append(year)

        imdb_rating = float(container.find('div', class_ = 'inline-block ratings-imdb-rating').text)
        imdb_ratings.append(imdb_rating)

        score = container.find('span', class_ = 'metascore').text
        metascores.append(score)

And I got this error:

AttributeError: 'NoneType' object has no attribute 'text'

I don't understand why this line of code doesn't work.

When I remove.text:

score = container.find('span', class_ = 'metascore')

It give me this:

<span class="metascore favorable">77        </span>

Any ideas?

Thanks

Some of the score tags are actually None hence the error. Try this:

import requests
from bs4 import BeautifulSoup

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

soup = BeautifulSoup(requests.get(url).text, 'html.parser')
movies_containers = soup.find_all('div', class_='lister-item mode-advanced')

names = []
years = []
imdb_ratings = []
metascores = []
votes = []
movie_description = []
movie_duration = []
movie_genre = []

for container in movies_containers:
    if container.find_all('div', class_='ratings-metascore') is not None:
        name = container.find('h3', class_='lister-item-header').a.text
        names.append(name)

        year = container.h3.find('span', class_='lister-item-year text-muted unbold').text
        years.append(year.replace('(', ' ').replace(')', ' '))

        imdb_rating = float(container.find('div', class_='inline-block ratings-imdb-rating').text)
        imdb_ratings.append(imdb_rating)

        score = container.find('span', class_='metascore')
        if score:
            metascores.append(score.getText(strip=True))
print(metascores)

Output:

['77', '74', '67', '84', '94', '76', '73', '85', '69', '81', '86', '88', '45', '81', '87', '75', '58', '65', '44', '62', '39', '65', '94', '48', '82', '52', '54', '93', '56', '73', '52', '41', '75', '47', '77', '63', '34', '75', '29', '51', '37', '65']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM