![](/img/trans.png)
[英]How to I get Python Scrapy to extract all of the domains of all external links from a web page?
[英]Python + web scraping + scrapy : How to get the links to all movies from an IMDb page?
我必须从此IMDb页面上抓取所有电影: https : //www.imdb.com/list/ls055386972/ 。
我的方法是首先抓取<a href="/title/tt0068646/?ref_=ttls_li_tt"
所有值,即提取/title/tt0068646/?ref_=ttls_li_tt
部分,然后添加' https:// www。 imdb.com ”以准备电影的完整URL,即https://www.imdb.com/title/tt0068646/?ref_=ttls_li_tt 。 但是每当我给出response.xpath('//h3[@class]/a[@href]').extract()
它都会提取所需的部分以及电影标题: [u'<a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler\\'s List</a>......]'
我只需要"/title/tt0068646/?ref_=ttls_li_tt"
部分。
如何进行?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/list/ls055386972/")
soup = BeautifulSoup(page.content, 'html.parser')
movies = soup.findAll('h3', attrs={'class' : 'lister-item-header'})
for movie in movies:
print(movie.a['href'])
输出 :
/title/tt0068646/?ref_=ttls_li_tt
/title/tt0108052/?ref_=ttls_li_tt
/title/tt0050083/?ref_=ttls_li_tt
/title/tt0118799/?ref_=ttls_li_tt
.
.
.
.
/title/tt0088763/?ref_=ttls_li_tt
/title/tt0266543/?ref_=ttls_li_tt
我建议您使用request-html获取所有超链接,并删除那些不符合您条件的超链接。 您甚至可以使用r.html.absolute_links
获得绝对网址
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.imdb.com/list/ls055386972/')
links = r.html.links
for i in range(len(links)):
if not links[i].startswith('/title/'):
del links[i]
print(links)
这是工作代码,请尝试:
class MoviesSpider():
name = 'movies' #name of the spider
allowed_domains = ['imdb.com']
start_url = 'http://imdb.com/list/ls055386972/'
def __init__(self):
super(MoviesSpider, self).__init__()
def start_requests(self):
yield Request(self.start_url, callback=self.parse, headers=self.headers)
def parse(self, response):
#events = response.xpath('//*[@property="url"]/@href').extract()
links = response.xpath('//h3[@class]/a/@href').extract()
final_links = []
for link in links:
final_link = 'http://www.imdb.com' + link
final_links.append(final_link)
for final_link in final_links:
absolute_url = response.urljoin(final_link)
yield Request(absolute_url, callback = self.parse_movies)
#process next page url
#next_page_url = response.xpath('//a[text() = "Next"]/@href').extract_first()
#absolute_next_page_url = response.urljoin(next_page_url)
#yield Request(absolute_next_page_url)
def parse_movies(self, response):
title = response.xpath('//div[@class = "title_wrapper"]/h1[@class]/text()').extract_first()
yield{
'title': title,
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.