[英]Scrapy - run spider multiple times
我以這種方式設置了一個爬蟲:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def crawler(mood):
process = CrawlerProcess(get_project_settings())
#crawl music selected by critics on the web
process.crawl('allmusic_{}_tracks'.format(mood), domain='allmusic.com')
# the script will block here until the crawling is finished
process.start()
#create containers for scraped data
allmusic = []
allmusic_tracks = []
allmusic_artists = []
# #process pipelined files
with open('blogs/spiders/allmusic_data/{}_tracks.jl'.format(mood), 'r+') as t:
for line in t:
allmusic.append(json.loads(line))
#fecth artists and their correspondant tracks
for item in allmusic:
allmusic_artists.append(item['artist'])
allmusic_tracks.append(item['track'])
return (allmusic_artists, allmusic_tracks)
我可以這樣運行:
artist_list, song_list = crawler('bitter')
print artist_list
而且效果很好。
但是如果我想連續運行幾次:
artist_list, song_list = crawler('bitter')
artist_list2, song_list2 = crawler('harsh')
我得到:
twisted.internet.error.ReactorNotRestartable
有沒有一種簡單的方法可以為該Spider設置包裝程序,以便我可以多次運行它?
這很簡單。
函數內部已經定義了一個單獨的進程。
因此,我可以這樣做:
def crawler(mood1, mood2):
process = CrawlerProcess(get_project_settings())
#crawl music selected by critics on the web
process.crawl('allmusic_{}_tracks'.format(mood1), domain='allmusic.com')
process.crawl('allmusic_{}_tracks'.format(mood2), domain='allmusic.com')
# the script will block here until the crawling is finished
process.start()
前提是您已經為每個進程定義了類。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.