简体   繁体   中英

How to stop scrapy from running the same spider twice?

So I'm following the doc for running the spider within the code, but for some reason after it finishes crawling, the spider is run again. I've tried adding the stop_after_crawl and stop() functions but to no luck. It also gives me the error below after it attempting to run a second time.

twisted.internet.error.ReactorNotRestartable

Any help is appreciated, thanks!

The Code

class DocSpider(scrapy.Spider):
"""
This is the broad scraper, the name is doc_spider and can be invoked by making an object
of the CrawlerProcess() then calling the class of the Spider. It scrapes websites csv file
for the content and returns the results as a .json file.
"""

#Name of Spider
name = 'doc_spider'

#File of the URL list here
urlsList = pd.read_csv('B:\docubot\DocuBots\Model\Data\linksToScrape.csv')
urls = []
#Take the urls and insert them into a url list
for url in urlsList['urls']:
    urls.append(url)

#Scrape through all the websites in the urls list
start_urls = urls

#This method will parse the results and will be called automatically
def parse(self, response):
    data = {}
    #Iterates through all <p> tags
    for content in response.xpath('/html//body//div[@class]//div[@class]//p'):
        if content:
            #Append the current url
            data['links'] = response.request.url
            #Append the texts within the <p> tags
            data['texts'] = " ".join(content.xpath('//p/text()').extract())

    yield data

def run_crawler(self):
    settings = get_project_settings()
    settings.set('FEED_FORMAT', 'json')
    settings.set('FEED_URI', 'scrape_results.json')
    c = CrawlerProcess(settings)
    c.crawl(DocSpider)
    c.start(stop_after_crawl=True)

D = DocSpider()
D.run_crawler()

Error Terminal Output

Traceback (most recent call last):
File "web_scraper.py", line 52, in <module>
D.run_crawler()
File "web_scraper.py", line 48, in run_crawler
c.start(stop_after_crawl=True)
File "B:\Python\lib\site-packages\scrapy\crawler.py", line 312, in start
reactor.run(installSignalHandlers=False)  # blocking call
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "B:\Python\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

You need to move run_spider outside of DocSpider class:

class DocSpider(scrapy.Spider):
    .....

def run_crawler(self):
    settings = get_project_settings()
    settings.set('FEED_FORMAT', 'json')
    settings.set('FEED_URI', 'scrape_results.json')
    c = CrawlerProcess(settings)
    c.crawl(DocSpider)
    c.start(stop_after_crawl=True)


run_crawler()

SOLUTION

Found the solution, apparently every time I imported the code, scrapy would run the spider again. So I had to specify to only run the spider only when I run the code by adding an if statement.

    def run_crawler(self):
       if __name__ ==  "__main__":
           settings = get_project_settings()
           settings.set('FEED_FORMAT', 'json')
           settings.set('FEED_URI', 'scrape_results.json')
           c = CrawlerProcess(settings)
           c.crawl(DocSpider)
           c.start(stop_after_crawl=True)

newProc = DocSpider()
newProc.run_crawler()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM