繁体   English   中英

当任务依赖关系变得过时时,luigi可以重新运行任务吗?

[英]Can luigi rerun tasks when the task dependencies become out of date?

据我所知, luigi.Target可以存在,也可以不存在。 因此,如果存在luigi.Target ,则不会重新计算。

我正在寻找一种方法来强制重新计算任务,如果其中一个依赖项被修改,或者其中一个任务的代码发生了变化。

实现目标的一种方法是覆盖complete(...)方法。

complete的文档很简单

只需实现一个检查约束的函数,如果要重新计算任务,则返回False

例如,要在更新依赖项时强制重新计算,您可以执行以下操作:

def complete(self):
    """Flag this task as incomplete if any requirement is incomplete or has been updated more recently than this task"""
    import os
    import time

    def mtime(path):
        return time.ctime(os.path.getmtime(path))

    # assuming 1 output
    if not os.path.exists(self.output().path):
        return False

    self_mtime = mtime(self.output().path) 

    # the below assumes a list of requirements, each with a list of outputs. YMMV
    for el in self.requires():
        if not el.complete():
            return False
        for output in el.output():
            if mtime(output.path) > self_mtime:
                return False

    return True

当任何需求不完整或者最近比当前任务更新或任何当前任务的输出不存在时,这将返回False

检测代码何时更改更难。 您可以使用类似的方案(检查mtime ),但除非每个任务都有自己的文件,否则它会被命中。

由于能够覆盖complete ,因此可以实现任何需要重新计算的逻辑。 如果你想要一个特定的complete方法来complete许多任务,我建议对luigi.Task子类luigi.Task ,在那里实现自定义complete ,然后从子类继承你的任务。

我已经迟到了,但这里有一个mixin,可以改进支持多个输入/输出文件的已接受答案。

class MTimeMixin:
    """
        Mixin that flags a task as incomplete if any requirement
        is incomplete or has been updated more recently than this task
        This is based on http://stackoverflow.com/a/29304506, but extends
        it to support multiple input / output dependencies.
    """

    def complete(self):
        def to_list(obj):
            if type(obj) in (type(()), type([])):
                return obj
            else:
                return [obj]

        def mtime(path):
            return time.ctime(os.path.getmtime(path))

        if not all(os.path.exists(out.path) for out in to_list(self.output())):
            return False

        self_mtime = min(mtime(out.path) for out in to_list(self.output()))

        # the below assumes a list of requirements, each with a list of outputs. YMMV
        for el in to_list(self.requires()):
            if not el.complete():
                return False
            for output in to_list(el.output()):
                if mtime(output.path) > self_mtime:
                    return False

        return True

要使用它,您只需使用class MyTask(Mixin, luigi.Task)声明您的类。

上面的代码对我很有用,除了我相信正确的时间戳比较mtime(path)必须返回一个浮点而不是一个字符串(“Sat”>“Mon”... [sic])。 简单地说,

def mtime(path):
    return os.path.getmtime(path)

代替:

def mtime(path):
    return time.ctime(os.path.getmtime(path))

关于下面发布的Shilad Sen的Mixin建议,请考虑以下示例:

# Filename: run_luigi.py
import luigi
from MTimeMixin import MTimeMixin

class PrintNumbers(luigi.Task):

    def requires(self):
        wreturn []

    def output(self):
        return luigi.LocalTarget("numbers_up_to_10.txt")

    def run(self):
        with self.output().open('w') as f:
            for i in range(1, 11):
                f.write("{}\n".format(i))

class SquaredNumbers(MTimeMixin, luigi.Task):

    def requires(self):
        return [PrintNumbers()]

    def output(self):
        return luigi.LocalTarget("squares.txt")

    def run(self):
        with self.input()[0].open() as fin, self.output().open('w') as fout:
            for line in fin:
                n = int(line.strip())
                out = n * n
                fout.write("{}:{}\n".format(n, out))

if __name__ == '__main__':
    luigi.run()

其中MTimeMixin与上面的帖子相同。 我使用一次运行任务

luigi --module run_luigi SquaredNumbers

然后我触摸文件numbers_up_to_10.txt并再次运行任务。 然后Luigi提出以下投诉:

  File "c:\winpython-64bit-3.4.4.6qt5\python-3.4.4.amd64\lib\site-packages\luigi-2.7.1-py3.4.egg\luigi\local_target.py", line 40, in move_to_final_destination
    os.rename(self.tmp_path, self.path)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'squares.txt-luigi-tmp-5391104487' -> 'squares.txt'

这可能只是一个Windows问题,而不是Linux上的问题,“mv ab”可能只是删除旧b,如果它已经存在且没有写保护。 我们可以使用Luigi / local_target.py的以下补丁修复此问题:

def move_to_final_destination(self):
    if os.path.exists(self.path):
        os.rename(self.path, self.path + time.strftime("_%Y%m%d%H%M%S.txt"))
    os.rename(self.tmp_path, self.path)

另外,为了完整起见,Mixin再次作为单独的文件,从另一篇文章:

import os

class MTimeMixin:
    """
        Mixin that flags a task as incomplete if any requirement
        is incomplete or has been updated more recently than this task
        This is based on http://stackoverflow.com/a/29304506, but extends
        it to support multiple input / output dependencies.
    """

    def complete(self):
        def to_list(obj):
            if type(obj) in (type(()), type([])):
                return obj
            else:
                return [obj]

        def mtime(path):
            return os.path.getmtime(path)

        if not all(os.path.exists(out.path) for out in to_list(self.output())):
            return False

        self_mtime = min(mtime(out.path) for out in to_list(self.output()))

        # the below assumes a list of requirements, each with a list of outputs. YMMV
        for el in to_list(self.requires()):
            if not el.complete():
                return False
            for output in to_list(el.output()):
                if mtime(output.path) > self_mtime:
                    return False

        return True

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM