简体   繁体   中英

scraping web data using python

I am trying to write a code for scraping data from the imdb top 250 web page. The code that I have written follows below. The code works and gives me my intended results. But the problem I am facing lies in the number of results the code is returning. When I use it on my laptop, it produces 23 results, the 1st 23 movies as listed by imdb. But when I run from one of my friend's, it produces proper 250 results. Why does this happen? What should be done in order to avoid this?

from bs4 import BeautifulSoup
import requests
import sys
from StringIO import StringIO

try:
    import cPickle as pickle
except:
    import pickle

url = 'http://www.imdb.com/chart/top'

response = requests.get(url)
soup = BeautifulSoup(response.text)

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.titleColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

print(len(movies))

for index in range(0, len(movies)):
    data = {"movie": movies[index].get_text(),
            "link": links[index],
            "starCast": crew[index],
            "rating": ratings[index],
            "vote": votes[index]}
    imdb.append(data)

print(imdb)


Test Run from my laptop result :
['9.21', '9.176', '9.015', '8.935', '8.914', '8.903', '8.892', '8.889', '8.877', '8.817', '8.786', '8.76', '8.737', '8.733', '8.716', '8.703', '8.7', '8.69', '8.69', '8.678', '8.658', '8.629', '8.619']
23

I realize this is a pretty old question, but I liked the idea enough to get the code working better. It now makes more individual data available by variables. I fixed it up for myself, but thought I'd share here in hopes that it could help someone else.

#!/usr/bin/env Python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re

# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    # Instead of "2.       The Godfather        (1972)"
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

# Print out some info
for item in imdb:
    print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM