简体   繁体   English

在Luigi中重用通用任务

[英]Re-using generic tasks in Luigi

I'm having trouble understanding how to make re-usable tasks in Luigi, and then use them in a concrete situation. 我在理解如何在Luigi中进行可重复使用的任务,然后在具体情况下使用它们时遇到了麻烦。

For example. 例如。 I have two generic tasks that do something to a file and then output the result: 我有两个通用的任务,它们对文件进行处理然后输出结果:

class GffFilter(luigi.Task):
    "Filters a GFF file to only one feature"
    feature = luigi.Parameter()
    out_file = luigi.Parameter()
    in_file = luigi.Parameter()
    ...

class BgZip(luigi.Task):
    "bgZips a file"
    out_file = luigi.Parameter()
    in_file = luigi.Parameter()
    ...

Now, I want a workflow that first filters, then bgzips a specific file using these tasks: 现在,我想要一个工作流程,该工作流程首先过滤,然后使用以下任务bgzip特定文件:

class FilterSomeFile(luigi.WrapperTask):
    def requires(self):
        return GffFilter(in_file='some.gff3', out_file='some.genes.gff3', filter='gene')

    def output(self):
        return self.inputs()

class BgZipSomeFile(luigi.Task):
    def run(self):
        filtered = FilterSomeFile()
        BzZip(filtered)

But this is awkward. 但这很尴尬。 In the first task I have no run method, and I'm just using dependencies to use the generic task. 在第一个任务中,我没有run方法,而我只是在使用依赖项来使用通用任务。 Is this correct? 这个对吗? Should I be using inheritance here instead? 我应该在这里使用继承吗?

Then in the second task, I can't use dependencies, because I need the output from FilterSomeFile in order to use BgZip . 然后,在第二个任务中,我不能使用依赖项,因为我需要FilterSomeFile的输出才能使用BgZip But using dynamic dependencies seems wrong, because luigi can't build a proper dependency graph. 但是使用动态依赖关系似乎是错误的,因为luigi无法构建适当的依赖关系图。

How should I make a Luigi workflow out of my generic tasks? 应该怎样做一个路易吉工作流出来的我一般任务?

But this is awkward. 但这很尴尬。 In the first task I have no run method, and I'm just using dependencies to use the generic task. 在第一个任务中,我没有运行方法,而我只是在使用依赖项来使用通用任务。 Is this correct? 这个对吗?

Yes, according to this page , the WrapperTask is a dummy task whose purpose is to define a workflow of tasks, therefore it doesn't perform any actions by itself. 是的,根据此页面WrapperTask是一个虚拟任务,其目的是定义任务的工作流程,因此它本身不会执行任何操作。 Instead, by defining several requirements, this task will be complete when every requirement, listed in the requires method, has been completed. 相反,通过定义几个需求,当requires方法中列出的每个需求都已完成时,此任务将完成。 The main difference of this WrapperTask to a regular Task , is that you don't need to define an output method to signal that this task suceeded, as can be seen here . 本的主要区别WrapperTask到正规的Task ,就是你不需要定义输出方法,发出信号,这个任务suceeded,可以看出这里

Then in the second task, I can't use dependencies, because I need the output from FilterSomeFile in order to use BgZip. 然后在第二个任务中,我不能使用依赖项,因为我需要FilterSomeFile的输出才能使用BgZip。 But using dynamic dependencies seems wrong, because luigi can't build a proper dependency graph. 但是使用动态依赖关系似乎是错误的,因为luigi无法构建适当的依赖关系图。

Theoretically, you could make FilterSomeFile have the same output as GffFilter , make the BgZipSomeFile require FilterSomeFile , and then use the FilterSomeFile.output() in BgZipSomeFile.run to access the the zipped file. 从理论上说,你可以把FilterSomeFile有输出相同GffFilter ,使BgZipSomeFile需要FilterSomeFile ,然后用FilterSomeFile.output()BgZipSomeFile.run访问的压缩文件。 However, this solution would be somewhat strange because: 但是,此解决方案有些奇怪,因为:

  • The wrapper task only "runs" 1 other task, so the wrapped task could be used directly, without having to create a WrapperTask . 包装器任务仅“运行”另外一个任务,因此包装后的任务可以直接使用,而无需创建WrapperTask A better usage of WrapperTask would involve merging BgZipSomeFile and FilterSomeFile in a single subclass of WrapperTask 有一种更好的使用WrapperTask将涉及合并BgZipSomeFileFilterSomeFile在一个子类WrapperTask

  • A Task is being instantiated in a run method. 正在运行方法中实例化Task This results in a dynamic dependency, but this is not needed in this problem. 这会导致动态依赖性,但这在此问题中不需要。

  • Finally, the input of GffFilter is hardcoded in FilterSomeFile task, which makes the workflow less useful. 最后, GffFilter的输入在FilterSomeFile任务中进行了硬编码,这使工作流程的使用性降低。 This could be avoided by making the WrapperClass still receive parameters, and pass these parameters to its requirements. 可以通过使WrapperClass仍然接收参数并将这些参数传递到其要求来避免这种情况。

A better solution would be: 更好的解决方案是:

import luigi as lg

class A(lg.Task):
    inFile = lg.Parameter()
    outFile = lg.Parameter()

    def run(self,):
        with open(self.inFile, "r") as oldFile:
            text = oldFile.read()

        text  += "*" * 10 + "\n" + "This text was added by task A.\n" + "*" * 10 + "\n"
        print(text)
        with open(self.outFile, "w") as newFile:
            newFile.write(text)

    def output(self,):
        return lg.LocalTarget(self.outFile)

class B(lg.Task):
    inFile = lg.Parameter()
    outFile = lg.Parameter()

    def run(self,):
        with open(self.inFile, "r") as oldFile:
            text = oldFile.read()

        text  += "*" * 10 + "\n" + "This text was added by task B.\n" + "*" * 10 + "\n"

        with open(self.outFile, "w") as newFile:
            newFile.write(text)

    def output(self,):
        return lg.LocalTarget(self.outFile)

class CustomWorkflow(lg.WrapperTask):
    mainOutFile = lg.Parameter()
    mainInFile = lg.Parameter()
    tempFile = "/tmp/myTempFile.txt"
    def requires(self,):
        return [    A(inFile = self.mainInFile, outFile = self.tempFile),
                    B(inFile = self.tempFile, outFile = self.mainOutFile)
                ]

This code can be run in command line with: 可以使用以下命令在命令行中运行此代码:

PYTHONPATH='.' luigi --module pythonModuleContainingTheTasks --local-scheduler CustomWorkflow --mainInFile ./text.txt --mainOutFile ./procText.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM