简体   繁体   English

多蜘蛛的Scrapy Item管道

[英]Scrapy Item pipeline for multi spiders

I have 2 spiders and run it here: 我有2个蜘蛛,在这里运行:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

settings = get_project_settings()

process1 = CrawlerProcess(settings)
process1.crawl('spider1')
process1.crawl('spider2')

process1.start()

and I want these spiders write a common file. 我希望这些蜘蛛编写一个通用文件。

This is Pipeline class: 这是管道类:

class FilePipeline(object):

    def __init__(self):
        self.file  = codecs.open('data.txt', 'w', encoding='utf-8')
        self.spiders = []

    def open_spider(self, spider):
        self.spiders.append(spider.name)

    def process_item(self, item, spider):
        line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
        self.file.write(line)

        return item

    def spider_closed(self, spider):
        self.spiders.remove(spider.name)
        if len(self.spiders) == 0:
            self.file.close()

But although I don't get error message, when all spiders are done writing in the common file i have less lines (item) than the scrapy log does. 但是,尽管我没有收到错误消息,但是当所有蜘蛛程序都完成了在公用文件中的写入操作后,我的行(项目)比刮擦日志要少。 A few lines are cut. 剪了几行。 Maybe there is some practice writing in one file simultaneously from two spiders? 也许有一些练习可以同时从两个蜘蛛写入一个文件?

UPDATE: 更新:

Thanks, everybody!) I implemented it this way: 谢谢大家!)我是这样实现的:

class FilePipeline1(object):
    lock = threading.Lock()
    datafile = codecs.open('myfile.txt', 'w', encoding='utf-8')

    def __init__(self):
        pass

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
        try:
            FilePipeline1.lock.acquire()
            if isinstance(item, VehicleItem):            
                FilePipeline1.datafile.write(line)
        except:
            pass
        finally:
            FilePipeline1.lock.release()

        return item

    def spider_closed(self, spider):
        pass

I agree with A. Abramov's answer. 我同意A. Abramov的回答。

Here is just an idea I had. 这只是我的一个想法。 You could create two tables in a DB of your choice and then merge them after both spiders are done crawling. 您可以在自己选择的数据库中创建两个表,然后在两个蜘蛛完成爬网之后合并它们。 You would have to keep track of the time the logs came in so you can order your logs based on time received. 您必须跟踪日志的进入时间,以便可以根据收到的时间订购日志。 You could then dump the db into whatever file type you would like. 然后,您可以将数据库转储到所需的任何文件类型中。 This way, the program doesn't have to wait for one process to complete before writing to the file and you don't have to do any multithreaded programming. 这样,程序不必在写入文件之前等待一个进程完成,也不必进行任何多线程编程。

UPDATE: 更新:

Actually, depending on how long your spiders are running, you could just store the log output and the time into a dictionary. 实际上,根据蜘蛛运行的时间长短,您可以将日志输出和时间存储到字典中。 Where the time are the keys and log output are the values. 时间是键,日志输出是值。 This would be easier than initializing a db. 这比初始化数据库要容易。 You could then dump the dict into your file in order by keys. 然后,您可以按键顺序将字典转储到文件中。

Both of the spiders you have in seperate threads write to the file simultaniously. 您在单独线程中拥有的两个Spider都会同时写入文件。 That will lead to problems such as the lines cutting out and some of them missing if you dont take care of syncronization, as the past says. 就像过去所说的那样,这导致诸如断线等问题,并且如果您不注意同步化,则会丢失其中的一些问题。 In order to do it, you need to either synchronize file access and only write whole record/lines, or to have a strategy for allocating regions of the file to different threads eg re-building a file with known offsets and sizes, and by default you have neither of these. 为此,您需要同步文件访问并仅写入整个记录/行,或者具有将文件区域分配给不同线程的策略,例如重新构建具有已知偏移量和大小的文件,并且默认情况下你都没有。 Generally, writing in the same time from two different threads into the same file is not a common method, and unless you really know what you're doing, I dont advise you to do so. 通常,从两个不同的线程同时写入同一文件不是一种常见的方法,除非您真的知道自己在做什么,否则我不建议您这样做。

Instead, i'd seperate the spiders IO functions, and wait for one's action to finish before I start the other - considering your threads arn't syncronized, it will both make the program more efficient & make it work :) If you want a code example of how to do this in your context, just ask for it and I'll happily provide it. 取而代之的是,我将分离器的IO功能分开,然后等待一个动作完成,然后再启动另一个-考虑您的线程未同步,这将使程序更高效并使其工作:)如果您想要在您的上下文中如何执行此操作的代码示例,只需提出要求,我会很乐意提供。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM