从蜘蛛抓取不同的网站（Scrapy）将数据放入 CSV 文件中

Question

为了尝试结合 2 个不同的Scrapy蜘蛛来抓取不相关的网站，我创建了这个脚本。 但现在我似乎无法将数据放入普通的 csv 或 json 文件中。 在我合并蜘蛛之前，我只会'scrapy crawl afg2 -o data_set.csv'，但现在这似乎不起作用。

在 csv 文件中获取数据的最简单方法是什么？ 这是我的代码：

import scrapy
from scrapy.crawler import CrawlerProcess


class KhaamaSpider1(scrapy.Spider):
    name = 'khaama1'
    allowed_domains = ['www.khaama.com/category/afghanistan']
    start_urls = ['https://www.khaama.com/category/afghanistan']

    def parse(self, response):
        container = response.xpath("//div[@class='post-area']")
        for x in container:
            doc = x.xpath(".//div[@class='blog-author']/descendant::node()[4]").get()
            title = x.xpath(".//div[@class='blog-title']/h3/a/text()").get()
            author = x.xpath(".//div[@class='blog-author']/a/text()").get()
            rel_url = x.xpath(".//div[@class='blog-title']/h3/a/@href").get()

            yield{
                'date_of_creation' : doc,
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }

class PajhwokSpider1(scrapy.Spider):
    name = 'pajhwok1'
    allowed_domains = ['www.pajhwok.com']
    start_urls = ['https://www.pajhwok.com/en/security-crime']

    def parse(self, response):
        container = response.xpath("//div[@class='node-inner clearfix']")
        for x in container:
            doc = x.xpath(".//div[@class='journlist-creation-article']/descendant::div[5]/text()").get()
            title = x.xpath(".//h2[@class='node-title']/a/text()").get()
            author = x.xpath(".//div[@class='field-item even']/a/text()").get()
            rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()

            yield{
                'date_of_creation' : doc,
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }
        
process = CrawlerProcess()
process.crawl(KhaamaSpider1)
process.crawl(PajhwokSpider1)
process.start()

Answer 1

通常，我们有一个蜘蛛爬取一个网站，我们可以将 output 相应地保存在文件中。 当您使用两个蜘蛛在单个文件中抓取不同站点时，一个选项可能有助于根据蜘蛛名称name = 'pajhwok1'编写管道并将数据存储到文件中查看此链接https://www.tutorialspoint.com/scrapy/scrapy_item_pipeline.htm

Answer 2

例如 2 个蜘蛛的 pipeliine.py。 它将在第二个蜘蛛之后关闭 json 文件。 您可以在此处获取更多信息https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import json

from itemadapter import ItemAdapter

spider_count = 2

class JsonWriterPipeline:
    file = open('items.json', 'w')

    def open_spider(self, spider):
        return None

    def close_spider(self, spider):
        global spider_count
        spider_count -= 1
        if spider_count == 0:
            self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        return item

从蜘蛛抓取不同的网站（Scrapy）将数据放入 CSV 文件中

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-07-23 12:32:47

解决方案2
0 2020-07-23 12:40:49

从蜘蛛抓取不同的网站（Scrapy）将数据放入 CSV 文件中

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-07-23 12:32:47

解决方案2 0 2020-07-23 12:40:49

解决方案1
0 已采纳 2020-07-23 12:32:47

解决方案2
0 2020-07-23 12:40:49