简体   繁体   English

python生成器列出了重新引发的异常

[英]python generator to list exception re-raised

I have a simple poller class (code snippet below) which retrieves files from a number of folders based on a regex. 我有一个简单的轮询器类(下面的代码片段),该类基于正则表达式从许多文件夹中检索文件。 I attempt to catch OSError exceptions and ignore them as files could be moved out/deleted/permissions etc... During some testing (in which i created/deleted a large nr of files) i noticed that when sorting the generator, the exceptions that were raised in the generator function (_get) were re-raised(?), and i had to use an additional try except block to get around this. 我尝试捕获OSError异常并忽略它们,因为文件可能会移出/删除/许可等。在一些测试(在其中创建/删除了大量文件)中,我注意到在对生成器进行排序时,在生成器函数(_get)中被重新引发(?),并且我不得不使用除块之外的其他尝试来解决此问题。

Any idea why this is happening? 知道为什么会这样吗? All comments/improvements appreciated! 所有评论/改进表示赞赏!

Thanks Timmah 谢谢蒂玛

def __init__(self, **kwargs):
    self._sortkey = kwargs.get('sortkey', os.path.getmtime)

def _get(self, maxitems=0):
    def customfilter(f):
        if self._exclude is not None and self._exclude.search(f): return False
        if self._regex is not None:
            return self._regex.search(f)

        return True

    count = 0
    for p in self.paths:
        if not os.path.isdir(p): raise PollException("'%s' is not a valid path." % (p), p)
        if maxitems and count >= maxitems: break
        try:
            for f in [os.path.join(p, f) for f in filter(customfilter, os.listdir(p))]:
                if maxitems and count >= maxitems: break

                if not self._validate(f): continue

                count += 1
                yield f
        except OSError:
            '''
            There will be instances where we wont have permission on the file/directory or
            when a file is moved/deleted before it was yielded.
            '''
            continue

def get(self, maxitems=0):
    try:
        if self._sortkey is not None:
            files = sorted(self._get(maxitems), key=self._sortkey, reverse=self._sortreverse)**
        else:
            files = self._get(maxitems)
    except OSError:
        '''
        self._sortkey uses os.path function to sort so exceptions can happen again
        '''
        return

    for f in files:
        yield f

if __name__ == '__main__':
    while True:
        for f in poll(paths=['/tmp'], regex="^.*\.CSV").get(10):
            print f

EDIT: Thanks to @ShadowRanger for pointing out the os.path function that was passed as sortkey param. 编辑:感谢@ShadowRanger指出了作为sortkey参数传递的os.path函数。

Posting an answer for posterity: Per psychic intuition (and confirmation in the comments ), self._sortkey was trying to stat the files being sorted. 发布为后人的答案:每通灵直觉 (并在评论中确认 ), self._sortkey试图stat文件进行排序。 While having read permission on a directory is sufficient to get the filenames contained within it, if you lack read permission on those files, you won't be able to stat them. 虽然对目录具有读取权限足以获取其中包含的文件名,但是如果您对这些文件缺乏读取权限,则将无法对其进行stat

Since sorted is executing the key function outside the generator scope, nothing in the generator is raising the exception, and therefore it can't catch it. 由于sorted在生成器范围之外执行key函数,因此生成器中没有任何东西引发异常,因此无法捕获该异常。 You'd need to pre-filter/pre-compute the stat values for each file (and drop files that can't be stat -ed), sort on that, then drop the (no longer relevant) stat data. 您需要预先过滤/预先计算每个文件的stat值(并删除无法stat文件),对其进行排序,然后删除(不再相关的) stat数据。 For example: 例如:

from operator import itemgetter

def with_key(filenames, key):
    '''Generates computed_key, filename pairs

    Silently filters out files where the key function raises OSError
    '''
    for f in filenames:
        try:
            yield key(f), f
        except OSError:
            pass

# ... skipping to the `sorted` call in get ...
# Replace the existing sorted call with:
# map(itemgetter(1), strips the key, yielding only the file name
files = map(itemgetter(1),
            sorted(
                   # Use with_key to filter and decorate filenames with sortkey
                   with_key(self._get(maxitems), self._sortkey),
                   # Use key=itemgetter(0) so only sortkey is considered for
                   # sorting (making sort stable, instead of performing fallback
                   # comparison between filenames when key is the same)
                   key=itemgetter(0), reverse=self._sortreverse))

It's basically performing the Schwartzian Transform (aka "Decorate-Sort-Undecorate") manually. 它基本上是手动执行Schwartzian转换 (也称为“装饰-排序-未装饰”)。 Normally, Python's key argument for sorted / list.sort hides this complexity from you, but in this case, thanks to the possibility of exceptions, the need to drop the item if one occurs and the desire to minimize race conditions by using EAFP patterns), you have to do the work yourself. 通常,Python的sorted / list.sortkey参数对您隐藏了这种复杂性,但是在这种情况下,由于可能会出现异常,需要在出现一个项目时删除该项目,并希望通过使用EAFP模式来最小化竞争条件) ,您必须自己完成工作。

Alternate solution with Python 3.5 (or 2.6-2.7 and 3.2-3.4 using the third party scandir package): 使用Python 3.5(或使用第三方scandir软件包的2.6-2.7和3.2-3.4)的备用解决方案:

You could avoid this issue (and on Windows, include unreadable files in your output so long as the directory was readable and on a Windows-like file system that caches file metadata in the directory entry) if you so desired, with far less complexity and likely better performance. 如果需要,您可以避免此问题(并且在Windows上,只要目录是可读的,就在输出中包含不可读的文件,并且在类似Windows的文件系统中在目录条目中缓存文件元数据),可以大大减少复杂度,并且可能会有更好的表现。 os.scandir (or pre-3.5, scandir.scandir ) on Windows gets you the stat information cached in the directory entry "for free" (you only pay the RTT cost once per few thousands entries in a directory, not once per file), and on Linux the first call to DirEntry.stat caches the stat data, so doing it in _get means you can catch and handle OSError there, populating the cache so during sorting, self._sortkey can use the cached data with no risk of OSError . os.scandir (或3.5版之前的scandir.scandir )可让您“免费”缓存在目录条目中的stat信息(您只需为目录中的每千个条目支付一次RTT费用,而不是为每个文件支付一次) ,在Linux上,第一次调用DirEntry.stat缓存stat数据,因此在_get意味着您可以在那里捕获并处理OSError并填充缓存,因此在排序过程中self._sortkey可以使用缓存的数据而不会出现OSError风险。 So you could do: 因此,您可以执行以下操作:

try:
    from os import scandir
except ImportError:
    from scandir import scandir

# Prestat will ensure OSErrors raised in _get, not in caller using DirEntry
def _get(self, maxitems=0, prestat=True, follow_symlinks=True):
    def customfilter(f):
        if self._exclude is not None and self._exclude.search(f):
            return False
        return self._regex is None or self._regex.search(f)

    count = 0
    for p in self.paths:
        if not os.path.isdir(p): raise PollException("'%s' is not a valid path." % (p,), p)
        if maxitems and count >= maxitems: break
        try:
            # Use scandir over listdir, and since we get DirEntrys, we
            # don't need to explicitly use os.path.join to make full paths
            # and we can use genexpr for validation instead
            for dirent in (de for de in scandir(p) if customfilter(de.name) and self._validate(de.path)):
                # On Windows, stat() is cheap noop (returns precomputed data)
                # except symlink w/follow_symlinks=True (where it stats and caches)
                # On Linux, this will force a stat now, and cache the result
                # so OSErrors will only be raised here, not during sorting
                if prestat:
                    dirent.stat(follow_symlinks=follow_symlinks)

                if maxitems and count >= maxitems: break

                count += 1
                yield dirent
        except OSError:
            '''
            There will be instances where we wont have permission on the file/directory or
            when a file is moved/deleted before it was yielded.
            '''
            continue

def get(self, maxitems=0):
    # Prestat if we have a sortkey (assuming it may use stat data)
    files = self._get(maxitems, prestat=self._sortkey is not None)
    if self._sortkey is not None:
        # self._sortkey must now operate on a os.DirEntry
        # but no more need to wrap in try/except OSError
        files = sorted(files, key=self._sortkey, reverse=self._sortreverse)

    # To preserve observable public behaviors, return path, not DirEntry
    for dirent in files:
        yield dirent.path

This requires a small change in usage; 这需要对用法进行少量更改; self._sortkey must operate on an os.DirEntry instance , not a file path. self._sortkey必须在os.DirEntry实例 (而不是文件路径)上运行。 So instead of self._sortkey = kwargs.get('sortkey', os.path.getmtime) , you might have self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime) . 因此,您可以使用self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime)来代替self._sortkey = kwargs.get('sortkey', os.path.getmtime) self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime)

But it avoids the complexity of manual Schwartzian Transforms (because access violations can only occur in _get 's try / except as long as you don't change prestat , so no OSErrors occur during key computation). 但这避免了手动Schwartzian转换的复杂性(因为访问冲突只能在_gettry /中发生, except您不更改prestat ,所以在key计算期间不会发生prestat )。 It will also likely run faster, by lazily iterating the directory instead of constructing a complete list before iterating (admittedly a small benefit unless the directory is huge) and removing the need to use a stat system call at all for most directory entries on Windows. 通过延迟迭代目录而不是在迭代之前构造一个完整list ,它可能也会运行得更快(除非目录很大,否则这是一个很小的好处),并且不需要对Windows上的大多数目录项使用stat系统调用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM