简体   繁体   中英

Unable to scrape while running scrapy spider sequentially

I'm new to scrapy and I'm trying to practice with and example, I want to run scrapy spiders sequentially but when I use the code from the documentation ( https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script ) while using crawler process it doesn't work. The spiders opens and close instantly without scraping data from the website. But when I run the spiders alone using "scrapy crawl" it works. I don't understand why spider scrape datas while I call it alone and doesn't scrape datas while I try to run it sequentially. If someone could help me with that it would be great. Here's the code that I'm using:

class APASpider(scrapy.Spider):
name = 'APA_test'
allowed_domains = ['some_domain.com']
start_urls = ['startin_url']



def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
        endpoint='execute',
        cache_args=['lua_source'],
        args={'lua_source': script,'timeout': 3600},
        headers={'X-My-Header': 'value'},
        )

def parse(self, response):

    for href in response.xpath('//a[@class="product-link"]/@href').extract():            
        yield SplashRequest(response.urljoin(href),self.parse_produits,
        endpoint='execute',
        cache_args=['lua_source'],
        args={'lua_source': script,'timeout': 3600},
        headers={'X-My-Header': 'value'},
        )


    for pages in response.xpath('//*[@id="loadmore"]/@href'):
        yield SplashRequest(response.urljoin(pages.extract()),self.parse,
        endpoint='execute',
        cache_args=['lua_source'],
        args={'lua_source': script,'timeout': 3600},
        headers={'X-My-Header': 'value'},
        )



def parse_produits(self,response):

    Nom = response.xpath("//h1/text()").extract()
    Poids = response.xpath('//p[@class="description"]/text()').extract()
    item_APA = APAitem()
    item_APA["Titre"] = Nom
    item_APA["Poids"] = Poids
    yield item_APA

configure_logging()
runner = CrawlerRunner()


@defer.inlineCallbacks
def crawl():
yield runner.crawl(APASpider)
reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Thank you

It's hard to tell exactly what is the issue there considering there are no log messages provided in the question.

That being said, I'll still try to answer as I've had this same issue a while ago.

There is this issue with scrapy_splash concerning local last_response = entries[#entries].response on Splash scripts. I'm assuming that you have it on your script, as I did.

The workaround I used was to check if history is not empty before taking last entry. (as suggested by github user kmike).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM