简体   繁体   中英

wrong Xpath in IMDB spider scrapy

Here: IMDB scrapy get all movie data

response.xpath("//*[@class='results']/tr/td[3]")

returns empty list. I tried to change it to:

response.xpath("//*[contains(@class,'chart full-width')]/tbody/tr")

without success.

Any help please? Thanks.

I did not have time to go through IMDB scrapy get all movie data thoroughly, but have got the gist of it. The Problem statement is to get All movie data from the given site. It involves two things. First is to go through all the pages that contain the list of all the movies of that year. While the Second one is to get the link to each movie and then here you do your own magic.

The problem you faced is with the getting the xpath for the link to each movies. This may most likely be due to change in the website structure (I did not have time to verify what maybe the difference). Anyways, following is the xpath you would require.


FIRST :

We take div class nav as a landmark and find the lister-page-next next-page class in its children.

response.xpath("//div[@class='nav']/div/a[@class='lister-page-next next-page']/@href").extract_first()

Here this will give : Link for the next page | returns None if at the last page (since next-page tag not present)


SECOND :

This is the original doubt by the OP.

#Get the list of the container having the title, etc
list = response.xpath("//div[@class='lister-item-content']")

#From the container extract the required links 
paths = list.xpath("h3[@class='lister-item-header']/a/@href").extract()

Now all you would need to do is loop through each of these paths element and request the page.


Thanks for your answer. I eventually used your xPath like so:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from crawler.items import MovieItem

IMDB_URL = "http://imdb.com"

class IMDBSpider(CrawlSpider):
    name = 'imdb'
    # in order to move the next page
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class='nav']/div/a[@class='lister-page-next next-page']",)),
                  callback="parse_page", follow= True),)

    def __init__(self, start=None, end=None, *args, **kwargs):
        super(IMDBSpider, self).__init__(*args, **kwargs)
        self.start_year = int(start) if start else 1874
        self.end_year = int(end) if end else 2017

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(self.start_year, self.end_year+1):
            # movies are sorted by number of votes
            yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))

    def parse_page(self, response):
        content = response.xpath("//div[@class='lister-item-content']")
        paths = content.xpath("h3[@class='lister-item-header']/a/@href").extract() # list of paths of movies in the current page

        # all movies in this page
        for path in paths:
            item = MovieItem()
            item['MainPageUrl'] = IMDB_URL + path
            request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
            request.meta['item'] = item
            yield request

    # make sure that the start_urls are parsed as well
    parse_start_url = parse_page

    def parse_movie_details(self, response):
        pass # lots of parsing....

Runs it with scrapy crawl imdb -a start=<start-year> -a end=<end-year>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM