[英]python generator to list exception re-raised
I have a simple poller class (code snippet below) which retrieves files from a number of folders based on a regex. 我有一个简单的轮询器类(下面的代码片段),该类基于正则表达式从许多文件夹中检索文件。 I attempt to catch OSError exceptions and ignore them as files could be moved out/deleted/permissions etc... During some testing (in which i created/deleted a large nr of files) i noticed that when sorting the generator, the exceptions that were raised in the generator function (_get) were re-raised(?), and i had to use an additional try except block to get around this.
我尝试捕获OSError异常并忽略它们,因为文件可能会移出/删除/许可等。在一些测试(在其中创建/删除了大量文件)中,我注意到在对生成器进行排序时,在生成器函数(_get)中被重新引发(?),并且我不得不使用除块之外的其他尝试来解决此问题。
Any idea why this is happening? 知道为什么会这样吗? All comments/improvements appreciated!
所有评论/改进表示赞赏!
Thanks Timmah 谢谢蒂玛
def __init__(self, **kwargs):
self._sortkey = kwargs.get('sortkey', os.path.getmtime)
def _get(self, maxitems=0):
def customfilter(f):
if self._exclude is not None and self._exclude.search(f): return False
if self._regex is not None:
return self._regex.search(f)
return True
count = 0
for p in self.paths:
if not os.path.isdir(p): raise PollException("'%s' is not a valid path." % (p), p)
if maxitems and count >= maxitems: break
try:
for f in [os.path.join(p, f) for f in filter(customfilter, os.listdir(p))]:
if maxitems and count >= maxitems: break
if not self._validate(f): continue
count += 1
yield f
except OSError:
'''
There will be instances where we wont have permission on the file/directory or
when a file is moved/deleted before it was yielded.
'''
continue
def get(self, maxitems=0):
try:
if self._sortkey is not None:
files = sorted(self._get(maxitems), key=self._sortkey, reverse=self._sortreverse)**
else:
files = self._get(maxitems)
except OSError:
'''
self._sortkey uses os.path function to sort so exceptions can happen again
'''
return
for f in files:
yield f
if __name__ == '__main__':
while True:
for f in poll(paths=['/tmp'], regex="^.*\.CSV").get(10):
print f
EDIT: Thanks to @ShadowRanger for pointing out the os.path function that was passed as sortkey param. 编辑:感谢@ShadowRanger指出了作为sortkey参数传递的os.path函数。
Posting an answer for posterity: Per psychic intuition (and confirmation in the comments ), self._sortkey
was trying to stat
the files being sorted. 发布为后人的答案:每通灵直觉 (并在评论中确认 ),
self._sortkey
试图stat
文件进行排序。 While having read permission on a directory is sufficient to get the filenames contained within it, if you lack read permission on those files, you won't be able to stat
them. 虽然对目录具有读取权限足以获取其中包含的文件名,但是如果您对这些文件缺乏读取权限,则将无法对其进行
stat
。
Since sorted
is executing the key
function outside the generator scope, nothing in the generator is raising the exception, and therefore it can't catch it. 由于
sorted
在生成器范围之外执行key
函数,因此生成器中没有任何东西引发异常,因此无法捕获该异常。 You'd need to pre-filter/pre-compute the stat
values for each file (and drop files that can't be stat
-ed), sort on that, then drop the (no longer relevant) stat
data. 您需要预先过滤/预先计算每个文件的
stat
值(并删除无法stat
文件),对其进行排序,然后删除(不再相关的) stat
数据。 For example: 例如:
from operator import itemgetter
def with_key(filenames, key):
'''Generates computed_key, filename pairs
Silently filters out files where the key function raises OSError
'''
for f in filenames:
try:
yield key(f), f
except OSError:
pass
# ... skipping to the `sorted` call in get ...
# Replace the existing sorted call with:
# map(itemgetter(1), strips the key, yielding only the file name
files = map(itemgetter(1),
sorted(
# Use with_key to filter and decorate filenames with sortkey
with_key(self._get(maxitems), self._sortkey),
# Use key=itemgetter(0) so only sortkey is considered for
# sorting (making sort stable, instead of performing fallback
# comparison between filenames when key is the same)
key=itemgetter(0), reverse=self._sortreverse))
It's basically performing the Schwartzian Transform (aka "Decorate-Sort-Undecorate") manually. 它基本上是手动执行Schwartzian转换 (也称为“装饰-排序-未装饰”)。 Normally, Python's
key
argument for sorted
/ list.sort
hides this complexity from you, but in this case, thanks to the possibility of exceptions, the need to drop the item if one occurs and the desire to minimize race conditions by using EAFP patterns), you have to do the work yourself. 通常,Python的
sorted
/ list.sort
的key
参数对您隐藏了这种复杂性,但是在这种情况下,由于可能会出现异常,需要在出现一个项目时删除该项目,并希望通过使用EAFP模式来最小化竞争条件) ,您必须自己完成工作。
scandir
package): scandir
软件包的2.6-2.7和3.2-3.4)的备用解决方案: You could avoid this issue (and on Windows, include unreadable files in your output so long as the directory was readable and on a Windows-like file system that caches file metadata in the directory entry) if you so desired, with far less complexity and likely better performance. 如果需要,您可以避免此问题(并且在Windows上,只要目录是可读的,就在输出中包含不可读的文件,并且在类似Windows的文件系统中在目录条目中缓存文件元数据),可以大大减少复杂度,并且可能会有更好的表现。
os.scandir
(or pre-3.5, scandir.scandir
) on Windows gets you the stat
information cached in the directory entry "for free" (you only pay the RTT cost once per few thousands entries in a directory, not once per file), and on Linux the first call to DirEntry.stat
caches the stat
data, so doing it in _get
means you can catch and handle OSError
there, populating the cache so during sorting, self._sortkey
can use the cached data with no risk of OSError
. os.scandir
(或3.5版之前的scandir.scandir
)可让您“免费”缓存在目录条目中的stat
信息(您只需为目录中的每千个条目支付一次RTT费用,而不是为每个文件支付一次) ,在Linux上,第一次调用DirEntry.stat
缓存stat
数据,因此在_get
意味着您可以在那里捕获并处理OSError
并填充缓存,因此在排序过程中self._sortkey
可以使用缓存的数据而不会出现OSError
风险。 So you could do: 因此,您可以执行以下操作:
try:
from os import scandir
except ImportError:
from scandir import scandir
# Prestat will ensure OSErrors raised in _get, not in caller using DirEntry
def _get(self, maxitems=0, prestat=True, follow_symlinks=True):
def customfilter(f):
if self._exclude is not None and self._exclude.search(f):
return False
return self._regex is None or self._regex.search(f)
count = 0
for p in self.paths:
if not os.path.isdir(p): raise PollException("'%s' is not a valid path." % (p,), p)
if maxitems and count >= maxitems: break
try:
# Use scandir over listdir, and since we get DirEntrys, we
# don't need to explicitly use os.path.join to make full paths
# and we can use genexpr for validation instead
for dirent in (de for de in scandir(p) if customfilter(de.name) and self._validate(de.path)):
# On Windows, stat() is cheap noop (returns precomputed data)
# except symlink w/follow_symlinks=True (where it stats and caches)
# On Linux, this will force a stat now, and cache the result
# so OSErrors will only be raised here, not during sorting
if prestat:
dirent.stat(follow_symlinks=follow_symlinks)
if maxitems and count >= maxitems: break
count += 1
yield dirent
except OSError:
'''
There will be instances where we wont have permission on the file/directory or
when a file is moved/deleted before it was yielded.
'''
continue
def get(self, maxitems=0):
# Prestat if we have a sortkey (assuming it may use stat data)
files = self._get(maxitems, prestat=self._sortkey is not None)
if self._sortkey is not None:
# self._sortkey must now operate on a os.DirEntry
# but no more need to wrap in try/except OSError
files = sorted(files, key=self._sortkey, reverse=self._sortreverse)
# To preserve observable public behaviors, return path, not DirEntry
for dirent in files:
yield dirent.path
This requires a small change in usage; 这需要对用法进行少量更改;
self._sortkey
must operate on an os.DirEntry
instance , not a file path. self._sortkey
必须在os.DirEntry
实例 (而不是文件路径)上运行。 So instead of self._sortkey = kwargs.get('sortkey', os.path.getmtime)
, you might have self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime)
. 因此,您可以使用
self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime)
来代替self._sortkey = kwargs.get('sortkey', os.path.getmtime)
self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime)
。
But it avoids the complexity of manual Schwartzian Transforms (because access violations can only occur in _get
's try
/ except
as long as you don't change prestat
, so no OSErrors occur during key
computation). 但这避免了手动Schwartzian转换的复杂性(因为访问冲突只能在
_get
的try
/中发生, except
您不更改prestat
,所以在key
计算期间不会发生prestat
)。 It will also likely run faster, by lazily iterating the directory instead of constructing a complete list
before iterating (admittedly a small benefit unless the directory is huge) and removing the need to use a stat
system call at all for most directory entries on Windows. 通过延迟迭代目录而不是在迭代之前构造一个完整
list
,它可能也会运行得更快(除非目录很大,否则这是一个很小的好处),并且不需要对Windows上的大多数目录项使用stat
系统调用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.