简体   繁体   English

负向前瞻断言不在python中工作

[英]negative lookahead assertion not working in python

Task: 任务:
- given: a list of images filenames - 给定:图像文件名列表
- todo: create a new list with filenames not containing the word "thumb" - ie only target the non-thumbnail images (with PIL - Python Imaging Library). - todo:创建一个文件名不包含单词“thumb”的新列表 - 即仅定位非缩略图图像(使用PIL - Python Imaging Library)。

I've tried r".*(?!thumb).*" but it failed. 我试过r".*(?!thumb).*"但它失败了。

I've found the solution (here on stackoverflow) to prepend a ^ to the regex and to put the .* into the negative lookahead: r"^(?!.*thumb).*" and this now works. 我已经找到了解决方案(这里是stackoverflow)将^到正则表达式并将.*放入负向前瞻: r"^(?!.*thumb).*"这现在有效。

The thing is, I would like to understand why my first solution did not work but I don't. 问题是,我想了解为什么我的第一个解决方案不起作用,但我不这样做。 Since regexes are complicated enough, I would really like to understand them. 由于正则表达式足够复杂,我真的很想理解它们。

What I do understand is that the ^ tells the parser that the following condition is to match at the beginning of the string. 我所理解的是^告诉解析器以下条件是在字符串的开头匹配。 But doesn't the .* in the (not working) first example also start at the beginning of the string? 但是,(不工作)第一个例子中的.*也不是从字符串的开头开始的吗? I thought it would start at the beginning of the string and search through as many characters as it can before reaching "thumb". 我认为它会从字符串的开头开始,并在到达“拇指”之前搜索尽可能多的字符。 If so it would return a non-match. 如果是这样,它将返回不匹配。

Could someone please explain why r".*(?!thumb).*" does not work but r"^(?!.*thumb).*" does? 有人可以解释为什么r".*(?!thumb).*"不起作用但是r"^(?!.*thumb).*"是吗?

Thanks! 谢谢!

Could someone please explain why r".*(?!thumb).*" does not work but r"^(?!.*thumb).*" does? 有人可以解释为什么r".*(?!thumb).*"不起作用但是r"^(?!.*thumb).*"是吗?

The first will always match as the .* will consume all the string (so it can't be followed by anything for the negative lookahead to fail). 第一个将始终匹配,因为.*将消耗所有字符串(因此它不能被任何内容跟随负向前瞻失败)。 The second is a bit convoluted and will match from the start of the line, the most amount of characters until it encounters 'thumb' and if that's present, then the entire match fails, as the line does begin with something followed by 'thumb'. 第二个是有点复杂的,并且将从行的开头匹配,最多的字符直到它遇到“拇指”并且如果存在,那么整个匹配失败,因为该行开始后面跟着'拇指' 。

Number two is more easily written as: 第二个更容易写为:

  • 'thumb' not in string
  • not re.search('thumb', string) (instead of match) not re.search('thumb', string) (而不是匹配)

Also as I mentioned in the comments, your question says: 正如我在评论中提到的,你的问题是:

filenames not containing the word "thumb" 文件名不包含单词 “拇指”

So you may wish to consider whether or not thumbs up is supposed to be excluded or not. 因此,您可能希望考虑是否应该排除thumbs up

(Darn, Jon beat me. Oh well, you can look at the examples anyway) (Darn,Jon打败了我。好吧,你可以看看这些例子)

Like the other guys have said, regex is not the best tool for this job. 就像其他人说的那样,正则表达式不是这项工作的最佳工具。 If you are working with filepaths, take a look at os.path . 如果您正在使用文件路径,请查看os.path

As for filtering files you don't want, you can do if 'thumb' not in filename: ... once you have dissected the path (where filename is a str ). 至于过滤你不想要的文件,你可以这样做, if 'thumb' not in filename: ...一旦你解剖了路径(其中filenamestr )。

And for posterity, here are my thoughts on those regex. 对后人来说,这是我对那些正则表达式的看法。 r".*(?!thumb).*" does not work as because .* is greedy and the lookahead is given a very low priority. r".*(?!thumb).*"不起作用,因为.*是贪婪的,前瞻的优先级非常低。 Take a look at this: 看看这个:

>>> re.search('(.*)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups()
('/tmp/somewhere/thumb', '', '')
>>> re.search('(.*?)((?!thumb))(.*)', '/tmp/somewhere/thumb').groups()
('', '', '/tmp/somewhere/thumb')
>>> re.search('(.*?)((?!thumb))(.*?)', '/tmp/somewhere/thumb').groups()
('', '', '')

The last one is quite strange... 最后一个很奇怪......

The other regex ( r"^(?!.*thumb).*" ) works because .* is inside the lookahead, so you don't have any issues with characters being stolen. 另一个正则表达式( r"^(?!.*thumb).*" )的作用是因为.*位于前瞻之内,所以你没有任何字符被盗的问题。 You actually don't even need the ^ , depending on if you are using re.match or re.search : 实际上你根本不需要^ ,这取决于你是使用re.match还是re.search

>>> re.search('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
('', 'humb')
>>> re.search('^((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> re.match('((?!.*thumb))(.*)', '/tmp/somewhere/thumb').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'

Ignoring all the bits about regular expressions, your task seems relatively simple: 忽略关于正则表达式的所有内容,您的任务似乎相对简单:

  • given: a list of images filenames 给定:图像文件名列表
  • todo: create a new list with filenames not containing the word "thumb" - ie only target the non-thumbnail images (with PIL - Python Imaging Library). todo:创建一个文件名不包含单词“thumb”的新列表 - 即仅定位非缩略图图像(使用PIL - Python Imaging Library)。

Assuming you have a list of filenames that looks something like this: 假设您有一个类似于下面的文件名列表:

filenames = [ 'file1.jpg', 'file1-thumb.jpg', 'file2.jpg', 'file2-thumb.jpg' ]

Then you can get a list of files not containing the word thumb like this: 然后你可以获得一个包含单词thumb的文件列表,如下所示:

not_thumb_filenames = [ filename for filename in filenames if not 'thumb' in filename ]

That's what we call a list comprehension , and is essentially shorthand for: 这就是我们所说的列表理解 ,基本上是简写:

not_thumb_filenames = []
for filename in filenames:
  if not 'thumb' in filename:
    not_thumb_filenames.append(filename)

Regular expressions aren't really necessary for this simple task. 对于这个简单的任务,正则表达式并不是必需的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM