[英]Exporting scraped data to a CSV file
我正在嘗試從要求我在抓取數據之前遵循2個URL的網站獲取數據。
目的是獲得一個如下所示的導出文件:
我的代碼如下:
import scrapy
from scrapy.item import Item, Field
from scrapy import Request
class myItems(Item):
info1 = Field()
info2 = Field()
info3 = Field()
info4 = Field()
class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']
def parse(self, response):
#Extracts first link
items = []
list1 = response.css("").extract() #extract all info from here
for i in list1:
link1 = 'https:...' + str(i)
request = Request(link1, self.parseInfo1, dont_filter =True)
request.meta['item'] = items
yield request
yield items
def parseInfo1(self, response):
#Extracts second link
item = myItems()
items = response.meta['item']
list1 = response.css("").extract()
for i in list1:
link1 = '' + str(i)
request = Request(link1, self.parseInfo2, dont_filter =True)
request.meta['item'] = items
items.append(item)
return request
def parseInfo2(self, response):
#Extracts all data
item = myItems()
items = response.meta['item']
item['info1'] = response.css("").extract()
item['info2'] = response.css("").extract()
item['info3'] = response.css("").extract()
item['info4'] = response.css("").extract()
items.append(item)
return items
我已經在終端中使用以下命令執行了Spider:
scrapy crawl techbot
我得到的數據是亂序的,並且有這樣的差距:
例如,它將多次刮擦第一組數據,而其余的則亂序。
如果有人能指出我的指示,使之以一開始顯示的更干凈的格式獲得結果,將不勝感激。
謝謝
通過將以下兩個鏈接合並為一個函數而不是兩個函數來解決此問題。 我的蜘蛛現在工作如下:
class mySpider(scrapy.Spider):
name = 'techbot'
start_urls = ['']
def parse(self, response):
#Extracts links
items = []
list1 = response.css("").extract()
for i in list1:
link1 = 'https:...' + str(i)
request = Request(link2, self.parse, dont_filter =True)
request.meta['item'] = items
yield request
list2 = response.css("").extract()
for i in list2:
link2 = '' + str(i)
request = Request(link1, self.parseInfo2, dont_filter =True)
request.meta['item'] = items
yield request
yield items
def parseInfo2(self, response):
#Extracts all data
item = myItems()
items = response.meta['item']
item['info1'] = response.css("").extract()
item['info2'] = response.css("").extract()
item['info3'] = response.css("").extract()
item['info4'] = response.css("").extract()
items.append(item)
return items
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.