简体   繁体   English

Scrapy Spider不想转到下一页

[英]Scrapy spider doesn't want to go to next page

H all, 全部

I am writing a scrapy crawler, here is my previous question about it: Scrapy: AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls' . 我正在编写一个爬虫爬网程序,这是我以前的问题: 爬虫:AttributeError:'YourCrawler'对象没有属性'parse_following_urls'

Now I am having another problem: it doesnt want to go to the next page: 现在我遇到另一个问题:它不想转到下一页:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "bookstore_2"
    start_urls = [
    'https://example.com/materias/?novedades=LC&p',
    ]
    allowed_domains = ["https://example.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('div#main'):
            yield {
            'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
            }

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

And it works and save the data of the links of the first page, but it fails when trying to go to the next page without any error. 它可以工作并保存第一页链接的数据,但是在尝试转到下一页时没有任何错误,它将失败。 This is the log: 这是日志:

…
2017-07-08 17:17:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com/book/?id=9780143039617>
{'book_isbn': [u'<li>Editorial: <a href="/search/avanzada/?go=1&amp;editorial=Penguin%20Books">Penguin Books</a></li>', u'<li>P\xe1ginas: 363</li>', u'<li>A\xf1o: 2206</li>', u'<li>Precio: 14.50 \u20ac</li>', u'<li>EAN: 9780143039617</li>']}
2017-07-08 17:17:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-08 17:17:25 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: bookstore_2.json
2017-07-08 17:17:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

I used this next page section in my first spider, and it was working. 我在第一个蜘蛛网中使用了下一页,该部分正在运行。 Any idea why this happens here? 知道为什么会在这里发生吗?

Your pagination logic should go at the end of parse method instead of parse_following_urls method as the pagination link is on main page and not on book details page. 您的分页逻辑应放在parse方法的末尾,而不是parse_following_urls方法的末尾,因为分页链接位于主页而非书本详细信息页上。 Also, I had to remove the scheme from allowed_domains . 另外,我必须从allowed_domains删除该方案。 Last thing, note that it yields Request at the end of parse method as you don't have scrapy module imported. 最后,请注意,由于您没有导入scrapy模块,因此它在parse方法结束时产生Request The spider looks like this: 蜘蛛看起来像这样:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "bookstore_2"
    start_urls = [
    'https://lacentral.com/materias/?novedades=LC&p',
    ]
    allowed_domains = ["lacentral.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('div#main'):
            yield {
                'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
            }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM