[英]How can I implement Scrapy Pause/Resume when scraping from multiple pages per item into one CSV file?
I've successfully implemented pause/resume in Scrapy with help from documentation ( https://doc.scrapy.org/en/latest/topics/jobs.html ) I can also scrape multiple pages to fill the values of one item in one csv line by adapting an example ( How can i use multiple requests and pass items in between them in scrapy python ) . 在文档( https://doc.scrapy.org/en/latest/topics/jobs.html )的帮助下,我已经成功地在Scrapy中实现了暂停/恢复。我还可以抓取多个页面以将一项的值填充为一项通过改编一个示例的csv行( 我该如何使用多个请求并在scrapy python中在它们之间传递项目 )。 However, I can't seem to get both functionalities to work together, so that I have a spider that scrapes from two pages for each item, and is capable of being paused and restarted.
但是,我似乎无法同时使用这两个功能,因此我有一个Spider可以从两个页面抓取每个项目,并且可以暂停和重新启动。
Here is my attempt with www.beeradvocate.com as an example. 这是我以www.beeradvocate.com为例的尝试。 urls_collection1 and urls_collection2 are list of >40,000 URLs each.
urls_collection1和urls_collection2分别是> 40,000个URL的列表。
Initiate 发起
def start_requests(self):
urls_collection1 = pd.read_csv('urls_collection1.csv')
#example url_collection1: 'https://www.beeradvocate.com/community/members/sammy.3853/?card=1'
urls_collection2 = pd.read_csv('urls_collection2.csv')
#example url_collection2: 'https://www.beeradvocate.com/user/beers/?ba=Sammy'
for i in range(len(urls_collection1)):
item = item()
yield scrapy.Request(urls_collection1.iloc[i,0],callback=self.parse1, meta={'item': item})
yield scrapy.Request(urls_collection2.iloc[i,0], callback=self.parse2, meta={'item': item})
#To allow for pause/resume
self.state['items_count'] = self.state.get('items_count', 0) + 1
Parse from first page 从第一页解析
def parse1(self, response):
item = response.meta['item']
item['gender_age'] = response.css('.userTitleBlurb .userBlurb').xpath('text()').extract_first()
yield item
Parse from second page 从第二页解析
def parse2(self,response):
item = response.meta['item']
item['num_reviews'] = response.xpath('//*[@id="ba-content"]/div/b/text()[2]').extract_first()
return item
Everything seems to work fine except that data scraped via parse 1 and parse 2 end up on different rows instead of on the same row as one item. 除了通过解析1和解析2抓取的数据最终位于不同的行而不是与一项相同的行之外,其他所有内容似乎都正常运行。
Try this: 尝试这个:
def start_requests(self):
urls_collection1 = pd.read_csv('urls_collection1.csv')
#example url_collection1: 'https://www.beeradvocate.com/community/members/sammy.3853/?card=1'
urls_collection2 = pd.read_csv('urls_collection2.csv')
#example url_collection2: 'https://www.beeradvocate.com/user/beers/?ba=Sammy'
for i in range(len(urls_collection1)):
item = item()
yield scrapy.Request(urls_collection1.iloc[i,0],
callback=self.parse1,
meta={'item': item,
'collection2_url': urls_collection2.iloc[i,0]})
def parse1(self, response):
collection2_url = respones.meta['collection2_url']
item = response.meta['item']
item['gender_age'] = response.css('.userTitleBlurb .userBlurb').xpath('text()').extract_first()
yield Request(collection2_url,
callback=self.parse2,
meta={'item': item})
def parse2(self,response):
item = response.meta['item']
item['num_reviews'] = response.xpath('//*[@id="ba-content"]/div/b/text()[2]').extract_first()
return item
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.