[英]scraping web data using python
我正在嘗試編寫一個代碼來從imdb top 250網頁抓取數據。 我寫的代碼如下。 代碼有效並且給了我預期的結果。 但我面臨的問題在於代碼返回的結果數量。 當我在筆記本電腦上使用它時,會產生23個結果,即imdb列出的前23部電影。 但是當我從我的一個朋友那里跑出來時,它會產生250個合適的結果。 為什么會這樣? 應該怎么做才能避免這種情況?
from bs4 import BeautifulSoup
import requests
import sys
from StringIO import StringIO
try:
import cPickle as pickle
except:
import pickle
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text)
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.titleColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
print(len(movies))
for index in range(0, len(movies)):
data = {"movie": movies[index].get_text(),
"link": links[index],
"starCast": crew[index],
"rating": ratings[index],
"vote": votes[index]}
imdb.append(data)
print(imdb)
Test Run from my laptop result :
['9.21', '9.176', '9.015', '8.935', '8.914', '8.903', '8.892', '8.889', '8.877', '8.817', '8.786', '8.76', '8.737', '8.733', '8.716', '8.703', '8.7', '8.69', '8.69', '8.678', '8.658', '8.629', '8.619']
23
我意識到這是一個非常古老的問題,但我喜歡這個想法足以讓代碼更好地工作。 它現在通過變量提供更多的個人數據。 我為自己解決了這個問題,但我想我會在這里分享,希望它可以幫助別人。
#!/usr/bin/env Python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re
# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
imdb = []
# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
# Seperate movie into: 'place', 'title', 'year'
# Instead of "2. The Godfather (1972)"
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index))+1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index))-(len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
imdb.append(data)
# Print out some info
for item in imdb:
print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.