[英]Assigning data from Scrapy spider to a variable
我在脚本中运行 Scrapy 蜘蛛,我想将抓取的数据分配给一个变量,而不是 output 到一个文件,然后读取该文件以获取数据。
现在蜘蛛正在将数据输出到 json 文件,然后我读取这些数据,根据需要排列数据,然后从被删除的蜘蛛中删除 json 文件(主要是因为我不知道如何覆盖蜘蛛输出)。 这可以工作并且可以满足我的要求,但绝对看起来很蛮力,有没有一种更有效的方法可以访问蜘蛛数据而不必先将 output 转到 json?
这是我的代码
class SpiderManager:
def __init__(self):
self.run_spider()
self.compile_json_data()
@staticmethod
def write_json(data, filename="quote_data.json"):
"""Write data to JSON file"""
with open(filename, "w") as f:
json.dump(data, f, indent=4)
@staticmethod
def read_json(filename="quote_data.json"):
"""Get data from JSON file"""
try:
with open(filename) as json_file:
data = json.load(json_file)
except FileNotFoundError:
data = OrderedDict()
except ValueError:
data = []
return data
@staticmethod
def compile_json_data(spider_file="quotes_spider.json"):
"""Read the data from the spider & created an OrderedDict"""
spider_data = SpiderManager.read_json(spider_file)
spider_data = sorted(spider_data, key=itemgetter("dob"))
ordered_data = OrderedDict()
for author_quote in spider_data:
ordered_data.update({author_quote["author"]: author_quote["quote"]})
SpiderManager.write_json(ordered_data, filename="quotes_dict.json")
try:
Path.cwd() / Path(spider_file).unlink()
except (FileNotFoundError, TypeError) as e:
pass
def run_spider(self):
"""Run the spider"""
process = CrawlerProcess({"FEED_FORMAT": "json",
"FEED_URI": "quotes_spider.json",
})
process.crawl(MySpider)
process.start()
class MySpider(scrapy.Spider):
name = "quotes"
temp_data = {}
def start_requests(self):
urls = [
'http://quotes.toscrape.com/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
quote_blocks = response.css("div.quote")
for quote_block in quote_blocks:
url = quote_block.css("a::attr(href)").get()
quote = quote_block.css("span::text").get().strip()
yield response.follow(url, self.parse_crossword,
cb_kwargs=dict(quote=quote))
def parse_crossword(self, response, quote):
author = response.css("h3::text").get().strip()
dob = response.css("span.author-born-date::text").get()
dob = datetime.strptime(dob, "%B %d, %Y")
yield {
"author": author,
"dob": dob,
"quote": quote
}
if __name__ == '__main__':
SpiderManager()
Items & ItemPipelines 是我实现这一目标所需要的。 我在 SpiderManager 中创建了一个 class 变量,然后使用管道将 append 的每个项目传递给 class 变量。 下面是我的代码,我添加了一个 Item class 和 Pipeline class,并指定了管道 CrawlerProcess
class SaveItemPipeline:
"""Append item to list in SpiderManager"""
def process_item(self, item, spider):
SpiderManager.spider_data.append(item)
class MyItem(Item):
author = Field()
dob = Field()
quote = Field()
...
class SpiderManager:
...
def run_spider(self):
"""Run the spider"""
process = CrawlerProcess({
"ITEM_PIPELINES": {SaveItemPipeline: 100},
})
process.crawl(MySpider)
process.start()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.