简体   繁体   English

从蜘蛛抓取不同的网站(Scrapy)将数据放入 CSV 文件中

[英]Put Data in a CSV File from a Spider Scraping Different Websites (Scrapy)

In an attempt to combine 2 different Scrapy spiders, that scrape unrelated websites, I created this script.为了尝试结合 2 个不同的Scrapy蜘蛛来抓取不相关的网站,我创建了这个脚本。 But now I can't seem to put the data into a normal csv or json file.但现在我似乎无法将数据放入普通的 csv 或 json 文件中。 Before I combined the spiders I would just 'scrapy crawl afg2 -o data_set.csv', but now that doesnt seem to work.在我合并蜘蛛之前,我只会'scrapy crawl afg2 -o data_set.csv',但现在这似乎不起作用。

What would be the easiest way to still get the data in a csv file?在 csv 文件中获取数据的最简单方法是什么? here is my code:这是我的代码:

import scrapy
from scrapy.crawler import CrawlerProcess


class KhaamaSpider1(scrapy.Spider):
    name = 'khaama1'
    allowed_domains = ['www.khaama.com/category/afghanistan']
    start_urls = ['https://www.khaama.com/category/afghanistan']

    def parse(self, response):
        container = response.xpath("//div[@class='post-area']")
        for x in container:
            doc = x.xpath(".//div[@class='blog-author']/descendant::node()[4]").get()
            title = x.xpath(".//div[@class='blog-title']/h3/a/text()").get()
            author = x.xpath(".//div[@class='blog-author']/a/text()").get()
            rel_url = x.xpath(".//div[@class='blog-title']/h3/a/@href").get()

            yield{
                'date_of_creation' : doc,
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }

class PajhwokSpider1(scrapy.Spider):
    name = 'pajhwok1'
    allowed_domains = ['www.pajhwok.com']
    start_urls = ['https://www.pajhwok.com/en/security-crime']

    def parse(self, response):
        container = response.xpath("//div[@class='node-inner clearfix']")
        for x in container:
            doc = x.xpath(".//div[@class='journlist-creation-article']/descendant::div[5]/text()").get()
            title = x.xpath(".//h2[@class='node-title']/a/text()").get()
            author = x.xpath(".//div[@class='field-item even']/a/text()").get()
            rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()

            yield{
                'date_of_creation' : doc,
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }
        
process = CrawlerProcess()
process.crawl(KhaamaSpider1)
process.crawl(PajhwokSpider1)
process.start()

Usually, we have one spider to crawl one website and we can save the output in the file accordingly.通常,我们有一个蜘蛛爬取一个网站,我们可以将 output 相应地保存在文件中。 As you are using two spiders for scrape to different sites in a single file an option might help to write a pipeline and store data according to into the file according to spider name name = 'pajhwok1' Have a look at this link https://www.tutorialspoint.com/scrapy/scrapy_item_pipeline.htm当您使用两个蜘蛛在单个文件中抓取不同站点时,一个选项可能有助于根据蜘蛛名称name = 'pajhwok1'编写管道并将数据存储到文件中查看此链接https://www.tutorialspoint.com/scrapy/scrapy_item_pipeline.htm

example pipeliine.py for 2 spiders.例如 2 个蜘蛛的 pipeliine.py。 It will close json file after 2nd spider.它将在第二个蜘蛛之后关闭 json 文件。 You can gate more information here https://docs.scrapy.org/en/latest/topics/item-pipeline.html您可以在此处获取更多信息https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import json

from itemadapter import ItemAdapter

spider_count = 2

class JsonWriterPipeline:
    file = open('items.json', 'w')

    def open_spider(self, spider):
        return None

    def close_spider(self, spider):
        global spider_count
        spider_count -= 1
        if spider_count == 0:
            self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        return item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM