简体   繁体   English

IMDB Spider Scrapy中的错误Xpath

[英]wrong Xpath in IMDB spider scrapy

Here: IMDB scrapy get all movie data 在这里: IMDB scrapy获取所有电影数据

response.xpath("//*[@class='results']/tr/td[3]") response.xpath( “// * [@类= '结果'] / TR / TD [3]”)

returns empty list. 返回空列表。 I tried to change it to: 我试图将其更改为:

response.xpath("//*[contains(@class,'chart full-width')]/tbody/tr") response.xpath(“ // * [包含(@class,'图表全角')] / tbody / tr”)

without success. 没有成功。

Any help please? 有什么帮助吗? Thanks. 谢谢。

I did not have time to go through IMDB scrapy get all movie data thoroughly, but have got the gist of it. 我没有时间去仔细研究IMDB来彻底获取所有电影数据 ,但要点是。 The Problem statement is to get All movie data from the given site. 问题语句是从给定站点获取所有电影数据。 It involves two things. 它涉及两件事。 First is to go through all the pages that contain the list of all the movies of that year. 首先是浏览包含该年所有电影列表的所有页面。 While the Second one is to get the link to each movie and then here you do your own magic. 第二个是获取每个电影的链接,然后在这里您自己做魔术。

The problem you faced is with the getting the xpath for the link to each movies. 您面临的问题是获取每个电影的链接的xpath。 This may most likely be due to change in the website structure (I did not have time to verify what maybe the difference). 这很可能是由于网站结构的变化(我没有时间来验证可能的差异)。 Anyways, following is the xpath you would require. 无论如何,以下是您需要的xpath


FIRST : 首先:

We take div class nav as a landmark and find the lister-page-next next-page class in its children. 我们将divnav作为地标,并在其子级中找到lister-page-next next-page

response.xpath("//div[@class='nav']/div/a[@class='lister-page-next next-page']/@href").extract_first()

Here this will give : Link for the next page | 这将给出: 指向下一页的链接 | returns None if at the last page (since next-page tag not present) 如果在最后一页 ,则返回None (因为不存在下一页标签)


SECOND : 第二:

This is the original doubt by the OP. 这是OP最初的怀疑。

#Get the list of the container having the title, etc
list = response.xpath("//div[@class='lister-item-content']")

#From the container extract the required links 
paths = list.xpath("h3[@class='lister-item-header']/a/@href").extract()

Now all you would need to do is loop through each of these paths element and request the page. 现在,您需要做的就是遍历每个paths元素并请求页面。


Thanks for your answer. 感谢您的回答。 I eventually used your xPath like so: 我最终像这样使用了xPath:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from crawler.items import MovieItem

IMDB_URL = "http://imdb.com"

class IMDBSpider(CrawlSpider):
    name = 'imdb'
    # in order to move the next page
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class='nav']/div/a[@class='lister-page-next next-page']",)),
                  callback="parse_page", follow= True),)

    def __init__(self, start=None, end=None, *args, **kwargs):
        super(IMDBSpider, self).__init__(*args, **kwargs)
        self.start_year = int(start) if start else 1874
        self.end_year = int(end) if end else 2017

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(self.start_year, self.end_year+1):
            # movies are sorted by number of votes
            yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))

    def parse_page(self, response):
        content = response.xpath("//div[@class='lister-item-content']")
        paths = content.xpath("h3[@class='lister-item-header']/a/@href").extract() # list of paths of movies in the current page

        # all movies in this page
        for path in paths:
            item = MovieItem()
            item['MainPageUrl'] = IMDB_URL + path
            request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
            request.meta['item'] = item
            yield request

    # make sure that the start_urls are parsed as well
    parse_start_url = parse_page

    def parse_movie_details(self, response):
        pass # lots of parsing....

Runs it with scrapy crawl imdb -a start=<start-year> -a end=<end-year> 使用scrapy crawl imdb -a start=<start-year> -a end=<end-year>运行它

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM