简体   繁体   English

使用python抓取web数据

[英]scraping web data using python

I am trying to write a code for scraping data from the imdb top 250 web page. 我正在尝试编写一个代码来从imdb top 250网页抓取数据。 The code that I have written follows below. 我写的代码如下。 The code works and gives me my intended results. 代码有效并且给了我预期的结果。 But the problem I am facing lies in the number of results the code is returning. 但我面临的问题在于代码返回的结果数量。 When I use it on my laptop, it produces 23 results, the 1st 23 movies as listed by imdb. 当我在笔记本电脑上使用它时,会产生23个结果,即imdb列出的前23部电影。 But when I run from one of my friend's, it produces proper 250 results. 但是当我从我的一个朋友那里跑出来时,它会产生250个合适的结果。 Why does this happen? 为什么会这样? What should be done in order to avoid this? 应该怎么做才能避免这种情况?

from bs4 import BeautifulSoup
import requests
import sys
from StringIO import StringIO

try:
    import cPickle as pickle
except:
    import pickle

url = 'http://www.imdb.com/chart/top'

response = requests.get(url)
soup = BeautifulSoup(response.text)

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.titleColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

print(len(movies))

for index in range(0, len(movies)):
    data = {"movie": movies[index].get_text(),
            "link": links[index],
            "starCast": crew[index],
            "rating": ratings[index],
            "vote": votes[index]}
    imdb.append(data)

print(imdb)


Test Run from my laptop result :
['9.21', '9.176', '9.015', '8.935', '8.914', '8.903', '8.892', '8.889', '8.877', '8.817', '8.786', '8.76', '8.737', '8.733', '8.716', '8.703', '8.7', '8.69', '8.69', '8.678', '8.658', '8.629', '8.619']
23

I realize this is a pretty old question, but I liked the idea enough to get the code working better. 我意识到这是一个非常古老的问题,但我喜欢这个想法足以让代码更好地工作。 It now makes more individual data available by variables. 它现在通过变量提供更多的个人数据。 I fixed it up for myself, but thought I'd share here in hopes that it could help someone else. 我为自己解决了这个问题,但我想我会在这里分享,希望它可以帮助别人。

#!/usr/bin/env Python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re

# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    # Instead of "2.       The Godfather        (1972)"
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

# Print out some info
for item in imdb:
    print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM