Getting data from multiple links using scrapy

Question

I am new to Scrapy and Python. I was trying to retrive the data from https://in.bookmyshow.com/movies since i need the information of all the movies I was trying to extract the data .But there is something wrong with my code, I would like to know where I have gone wrong .

rules = ( Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)


def parse_items(self, response):
    for sel in response.xpath('//div[contains(@class, "movie-card")]'):
        item = Ex1Item()
        item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
        item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
        item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
        item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
        item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
        yield item

Answer 1

You code seems to be fine. Perhaps the problem is outside of the part you posted here.

This worked for me:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class BookmyshowSpider(CrawlSpider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']
    rules = (Rule(SgmlLinkExtractor(allow=('https://in\.bookmyshow\.com/movies/.*', )), callback="parse_items", follow= True),)

    def parse_items(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = Ex1Item()
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

EDIT: Version using the standard spider class scrapy.Spider()

import scrapy

class BookmyshowSpider(scrapy.Spider):
    name = "bookmyshow"
    start_urls = ['https://in.bookmyshow.com/movies']
    allowed_domains = ['bookmyshow.com']

    def parse(self, response):
        links = response.xpath('//a/@href').re('movies/[^\/]+\/.*$')
        for url in set(links):
            url = response.urljoin(url)
            yield scrapy.Request(url, callback=self.parse_movie)

    def parse_movie(self, response):
        for sel in response.xpath('//div[contains(@class, "movie-card")]'):
            item = {}
            item['Moviename'] = sel.xpath('.//a[@class="__movie-name"]/text()').extract()
            item['Language'] = sel.xpath('/html/body/div[1]/div[2]/div/div[1]/div[2]/section[1]/div/div[2]/div[1]/div[1]/div/div/div[2]/div[2]/ul/li/text()').extract()
            item['Info'] = sel.xpath('.//div[@class="__rounded-box __genre"]/text()').extract()
            item['Synopsis'] = sel.xpath('/html/body/div[1]/div[2]/div[1]/div[2]/div[4]/div[2]/div[2]/blockquote/text()').extract()
            item['Release'] = sel.xpath('.//span[@class="__release-date"]/text()').extract()
            yield item

parse() parses all links to movie pages from the start page. parse_movie() is used as a callback for all Requests to the specific movie pages. With this version you certainly have more control over the spider behavior.

Getting data from multiple links using scrapy

Question

1 answers

solution1
2 ACCPTED 2016-03-21 23:21:43

Getting data from multiple links using scrapy

Question

1 answers

solution1 2 ACCPTED 2016-03-21 23:21:43

solution1
2 ACCPTED 2016-03-21 23:21:43