简体   繁体   English

当新文件到达S3时,触发luigi任务

[英]When a new file arrives in S3, trigger luigi task

I have a bucket with new objects getting added at random intervals with keys based on their time of creation. 我有一个带有新对象的存储桶,其中会根据它们的创建时间随机添加键。 For example: 例如:

's3://my-bucket/mass/%s/%s/%s/%s/%s_%s.csv' % (time.strftime('%Y'), time.strftime('%m'), time.strftime('%d'), time.strftime('%H'), name, the_time)

In fact, these are the outputs of Scrapy crawls. 实际上,这些是Scrapy爬网的输出。 I want to trigger a task that matches these crawls to a master .csv product catalog file I have (call it "product_catalog.csv"), which also gets updated regularly. 我想触发一个将这些爬网与我拥有的主.csv产品目录文件(称为“ product_catalog.csv”)匹配的任务,该文件也会定期更新。

Right now, I have several Python scripts I have written with global variables that I fill in every time I run this process. 现在,我有一些用全局变量编写的Python脚本,每次运行此过程时都会填写这些变量。 Those need to become imported attributes. 这些需要成为导入的属性。

So here is what needs to happen: 所以这是需要发生的事情:

1) New csv file shows up in "s3://my-bucket/mass/..." with a unique key name based on the time the crawl completed. 1)新的csv文件会根据抓取完成的时间显示在“ s3:// my-bucket / mass / ...”中,并具有唯一的键名称。 Luigi sees this and begins. 路易吉(Luigi)看到了这一点,开始了。
2) "cleaning.py" gets run by luigi on the new file, so the parameter of "cleaning.py" (the file that showed up in S3) needs to be supplied to it at runtime. 2)luigi在新文件上运行“ cleaning.py”,因此需要在运行时向其提供“ cleaning.py”(在S3中显示的文件)参数。 The results get saved in S3 in addition to being passed on to the next step. 除了传递到下一步之外,结果还保存在S3中。
3) The latest version of "product_catalog.csv" is pulled from a database and uses the results of "cleaning.py" in "matching.py" 3)从数据库中获取“ product_catalog.csv”的最新版本,并使用“ matching.py”中的“ cleaning.py”结果

I realize this may not make complete sense. 我意识到这可能并不完整。 I will supply edits as necessary to make it all more clear. 我将根据需要提供编辑内容,以使其更加清晰。

EDIT 编辑

Based on initial answers, I have configured this to be a pull operation that saves steps along the way. 根据最初的答案,我将其配置为拉操作,从而节省了执行过程中的步骤。 But now I am pretty lost. 但是现在我很迷茫。 It should be noted that this is my first time tying a Python project together, so there are things like including init .py that I am learning as I do this. 应该注意的是,这是我第一次将Python项目绑定在一起,因此我在学习诸如init .py之类的东西时会学习到。 As usual, it is a bumpy road of excitement from successes followed immediately by confusion at the next roadblock. 与往常一样,成功是一条坎from的激动之路,紧接着又是下一个障碍。

Here are my questions: 这是我的问题:
1) How to import the spiders from Scrapy is unclear to me. 1)我不清楚如何从Scrapy导入蜘蛛。 I have about a dozen of them and the goal is to have luigi manage the process of crawl>clean>match for all of them. 我大约有十二个,目标是让luigi管理所有这些对象的crawl> clean> match的过程。 The Scrapy documentation says to include: Scrapy文档说包括:

class MySpider(scrapy.Spider):
    # Your spider definition

What does that mean? 这意味着什么? Re-write the spider in the script controlling the spider? 在控制蜘蛛的脚本中重新编写蜘蛛? That makes no sense and their examples are not helpful. 那没有道理,他们的例子也无济于事。

2) I have configured Scrapy pipelines to export to S3 but luigi also seems to do this with output(). 2)我已经配置了Scrapy管道以导出到S3,但是luigi似乎也可以通过output()来做到这一点。 Which should I use and how do I get them to play together? 我应该使用哪个?如何使它们一起玩?

3) Luigi says that CrawlTask() ran successfully but that is wrong because it completes in seconds and the crawls usually take a few minutes. 3)Luigi说CrawlTask​​()成功运行了,但是这是错误的,因为它可以在几秒钟内完成,并且爬网通常需要几分钟。 There is also no output file corresponding to success. 也没有对应于成功的输出文件。

4) Where do I supply the credentials for S3? 4)我在哪里提供S3的凭据?

Here is my code. 这是我的代码。 I have commented out things that weren't working in lieu of what I perceive to be better. 我已经注释掉了无法代替我认为更好的东西。 But my sense is that there is a grand architecture to what I want to do that I just don't understand yet. 但是我的感觉是,我想做的事情有一个宏伟的架构,我只是还不了解。

import luigi
from luigi.s3 import S3Target, S3Client
import my_matching
from datetime import datetime
import os
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from my_crawlers.my_crawlers.spiders import my_spider

class CrawlTask(luigi.Task):
    crawltime = datetime.now()
    spider = luigi.Parameter()
    #vertical = luigi.Parameter()

    def requires(self):
        pass

    def output(self):
        return luigi.LocalTarget("actual_data_staging/crawl_luigi_test_{}.csv".format(self.crawltime))
        #return S3Target("s3://my-bucket/mass/crawl_luigi_test_{}.csv".format(self.crawltime))

    def run(self):
        os.system("scrapy crawl %s" % self.spider)
        #process = CrawlerProcess(get_project_settings())
        #process.crawl("%s" % self.spider)
        #process.start()

class FetchPC(luigi.Task):
    vertical = luigi.Parameter()

    def output(self):
        if self.vertical == "product1":
            return "actual_data_staging/product1_catalog.csv"
        elif self.vertical == "product2":
            return "actual_data_staging/product2_catalog.csv"

class MatchTask(luigi.Task):
    crawltime = CrawlTask.crawltime
    vertical = luigi.Parameter()
    spider = luigi.Parameter()

    def requires(self):
        return CrawlTask(spider=self.spider)
        return FetchPC(vertical=self.vertical)

    def output(self):
        return luigi.LocalTarget("actual_data_staging/crawl_luigi_test_matched_{}.csv".format(self.crawltime))
        #return S3Target("s3://my-bucket/mass/crawl_luigi_test_matched_{}.csv".format(CrawlTask.crawltime))

    def run(self):
        if self.vertical == 'product1':
            switch_board(requires.CrawlTask(), requires.FetchPC())

The MatchTask refers to a python script I wrote that compares the scraped products to my product catalog. MatchTask指的是我编写的Python脚本,该脚本将报废的产品与我的产品目录进行比较。 It looks like this: 看起来像这样:

def create_search(value):
...
def clean_column(column):
...
def color_false_positive():
...
def switch_board(scrape, product_catalog):
# this function coordinates the whole script

Below is a very rough outline of how it could look. 下面是它看起来的非常粗略的轮廓。 I think the main difference from your outline in regards to luigi working as a pull system is that you specify the output you want first, which then triggers the other tasks upon which that output depends. 我认为与luigi作为拉动系统的工作方式之间的主要区别在于,您首先指定所需的输出,然后触发该输出所依赖的其他任务。 So, rather than naming things with the time the crawl ends, it is easier to name things after something you know at the start. 因此,与其在爬网结束时命名事物,不如以一开始就知道的东西来命名事物。 It is possible to do it the other way, just a lot of unnecessary complication. 可以用另一种方式来做,只是很多不必要的复杂性。

class CrawlTask(luigi.Task):
    crawltime = luigi.DateParameter()

    def requires(self):
        pass

    def get_filename(self):
        return "s3://my-bucket/crawl_{}.csv".format(self.crawltime)

    def output(self):
        return S3Target(self.get_filename())

    def run(self):
        perform_crawl(s3_filename=self.get_filename())


class CleanTask(luigi.Task):
    crawltime = luigi.DateParameter()

    def requires(self):
        return CrawlTask(crawltime=self.crawltime)

    def get_filename(self):
        return "s3://my-bucket/clean_crawl_{}.csv".format(self.crawltime)

    def output(self):
        return S3Target(self.get_filename())

    def run(self):
        perform_clean(input_file=self.input().path, output_filename=self.get_filename())


class MatchTask(luigi.Task):
    crawltime = luigi.DateParameter()

    def requires(self):
        return CleanTask(crawltime=self.crawltime)

    def output(self):
        return ##?? whatever output of this task is

    def run(self):
        perform_match(input_file=self.input().path)

What you could do is create a larger system that encapsulates both your crawls and processing. 您可以做的是创建一个同时包含爬网和处理功能的大型系统。 This way you don't have to check s3 for new objects. 这样,您就不必检查s3中是否有新对象。 I haven't used luigi before, but maybe you can turn your scrapy job into a task, and when it's done do your processing task. 我以前没有使用过luigi,但是也许您可以将您的拼凑工作变成一项任务,完成后再执行您的处理任务。 Anyway, I don't think 'checking' s3 for new stuff is a good idea because 1. you will have to use lots of API calls, and 2. You will need to write a bunch of code to check if something is 'new' or not, which could get hairy. 无论如何,我认为“检查” s3中的新内容不是一个好主意,因为1.您将不得不使用大量API调用,并且2.您将需要编写一堆代码来检查是否有“新内容”还是不可以,这可能会变得毛茸茸。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM