简体   繁体   中英

Scrapy spider doesn't want to go to next page

H all,

I am writing a scrapy crawler, here is my previous question about it: Scrapy: AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls' .

Now I am having another problem: it doesnt want to go to the next page:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "bookstore_2"
    start_urls = [
    'https://example.com/materias/?novedades=LC&p',
    ]
    allowed_domains = ["https://example.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('div#main'):
            yield {
            'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
            }

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

And it works and save the data of the links of the first page, but it fails when trying to go to the next page without any error. This is the log:

…
2017-07-08 17:17:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com/book/?id=9780143039617>
{'book_isbn': [u'<li>Editorial: <a href="/search/avanzada/?go=1&amp;editorial=Penguin%20Books">Penguin Books</a></li>', u'<li>P\xe1ginas: 363</li>', u'<li>A\xf1o: 2206</li>', u'<li>Precio: 14.50 \u20ac</li>', u'<li>EAN: 9780143039617</li>']}
2017-07-08 17:17:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-08 17:17:25 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: bookstore_2.json
2017-07-08 17:17:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

I used this next page section in my first spider, and it was working. Any idea why this happens here?

Your pagination logic should go at the end of parse method instead of parse_following_urls method as the pagination link is on main page and not on book details page. Also, I had to remove the scheme from allowed_domains . Last thing, note that it yields Request at the end of parse method as you don't have scrapy module imported. The spider looks like this:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "bookstore_2"
    start_urls = [
    'https://lacentral.com/materias/?novedades=LC&p',
    ]
    allowed_domains = ["lacentral.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('div#main'):
            yield {
                'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
            }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM