[英]Put Data in a CSV File from a Spider Scraping Different Websites (Scrapy)
为了尝试结合 2 个不同的Scrapy蜘蛛来抓取不相关的网站,我创建了这个脚本。 但现在我似乎无法将数据放入普通的 csv 或 json 文件中。 在我合并蜘蛛之前,我只会'scrapy crawl afg2 -o data_set.csv',但现在这似乎不起作用。
在 csv 文件中获取数据的最简单方法是什么? 这是我的代码:
import scrapy
from scrapy.crawler import CrawlerProcess
class KhaamaSpider1(scrapy.Spider):
name = 'khaama1'
allowed_domains = ['www.khaama.com/category/afghanistan']
start_urls = ['https://www.khaama.com/category/afghanistan']
def parse(self, response):
container = response.xpath("//div[@class='post-area']")
for x in container:
doc = x.xpath(".//div[@class='blog-author']/descendant::node()[4]").get()
title = x.xpath(".//div[@class='blog-title']/h3/a/text()").get()
author = x.xpath(".//div[@class='blog-author']/a/text()").get()
rel_url = x.xpath(".//div[@class='blog-title']/h3/a/@href").get()
yield{
'date_of_creation' : doc,
'title' : title,
'author' : author,
'rel_url' : rel_url
}
class PajhwokSpider1(scrapy.Spider):
name = 'pajhwok1'
allowed_domains = ['www.pajhwok.com']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.xpath("//div[@class='node-inner clearfix']")
for x in container:
doc = x.xpath(".//div[@class='journlist-creation-article']/descendant::div[5]/text()").get()
title = x.xpath(".//h2[@class='node-title']/a/text()").get()
author = x.xpath(".//div[@class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()
yield{
'date_of_creation' : doc,
'title' : title,
'author' : author,
'rel_url' : rel_url
}
process = CrawlerProcess()
process.crawl(KhaamaSpider1)
process.crawl(PajhwokSpider1)
process.start()
通常,我们有一个蜘蛛爬取一个网站,我们可以将 output 相应地保存在文件中。 当您使用两个蜘蛛在单个文件中抓取不同站点时,一个选项可能有助于根据蜘蛛名称name = 'pajhwok1'
编写管道并将数据存储到文件中查看此链接https://www.tutorialspoint.com/scrapy/scrapy_item_pipeline.htm
例如 2 个蜘蛛的 pipeliine.py。 它将在第二个蜘蛛之后关闭 json 文件。 您可以在此处获取更多信息https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
from itemadapter import ItemAdapter
spider_count = 2
class JsonWriterPipeline:
file = open('items.json', 'w')
def open_spider(self, spider):
return None
def close_spider(self, spider):
global spider_count
spider_count -= 1
if spider_count == 0:
self.file.close()
def process_item(self, item, spider):
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
return item
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.