简体   繁体   English

Python regex可选数字匹配返回的结果超出预期

[英]Python regex optional number match returns more than expected

I have a list of files, and I am trying to filter for a subset of file names that end in 000000, 060000, 120000, 180000. I know I could do a straight string match, but I would like to understand why the regular expression I attempted below r'[00|06|12|18]+0000', would not work (it is returning MSM_20130519210000.csv as well). 我有一个文件列表,我试图过滤以000000、060000、120000、180000结尾的文件名的子集。我知道我可以进行直接的字符串匹配,但是我想了解为什么正则表达式我尝试在r'[00 | 06 | 12 | 18] +0000'以下运行,但不起作用(它也返回MSM_20130519210000.csv)。 I intend it to be match either one of 00, 06, 12, 18, follow by 0000. How can that be accomplished? 我希望将其匹配为00、06、12、18中的任何一个,然后匹配0000。如何实现? Please keep the answer along the line of this intended regex instead of other functions, thanks. 谢谢,请保持答案符合预期的正则表达式而不是其他功能。

Here is the code snippet: 这是代码片段:

import re

files_in_input_directory = ['MSM_20130519150000.csv', 'MSM_20130519180000.csv', 'MSM_20130519210000.csv', 
'MSM_20130520000000.csv', 'MSM_20130520030000.csv', 'MSM_20130520060000.csv', 'MSM_20130520090000.csv', 
'MSM_20130520120000.csv', 'MSM_20130520150000.csv', 'MSM_20130520180000.csv', 'MSM_20130520210000.csv', 
'MSM_20130521000000.csv', 'MSM_20130521030000.csv', 'MSM_20130521060000.csv', 'MSM_20130521090000.csv', 
'MSM_20130521120000.csv', 'MSM_20130521150000.csv', 'MSM_20130521180000.csv', 'MSM_20130521210000.csv', 
'MSM_20130522000000.csv', 'MSM_20130522030000.csv', 'MSM_20130522060000.csv', 'MSM_20130522090000.csv', 
'MSM_20130522120000.csv', 'MSM_20130522150000.csv', 'MSM_20130522180000.csv', 'MSM_20130522210000.csv', 
'MSM_20130523000000.csv', 'MSM_20130523030000.csv', 'MSM_20130523060000.csv', 'MSM_20130523090000.csv', 
'MSM_20130523120000.csv', 'MSM_20130523150000.csv', 'MSM_20130523180000.csv', 'MSM_20130523210000.csv', 
'MSM_20130524000000.csv', 'MSM_20130524030000.csv', 'MSM_20130524060000.csv', 'MSM_20130524090000.csv', 
'MSM_20130524120000.csv', 'MSM_20130524150000.csv', 'MSM_20130524180000.csv', 'MSM_20130524210000.csv', 
'MSM_20130525000000.csv', 'MSM_20130525030000.csv', 'MSM_20130525060000.csv', 'MSM_20130525090000.csv', 
'MSM_20130525120000.csv', 'MSM_20130525150000.csv', 'MSM_20130525180000.csv', 'MSM_20130525210000.csv', 
'MSM_20130526000000.csv', 'MSM_20130526030000.csv', 'MSM_20130526060000.csv', 'MSM_20130526090000.csv', 
'MSM_20130526120000.csv', 'MSM_20130526150000.csv', 'MSM_20130526180000.csv', 'MSM_20130526210000.csv', 
'MSM_20130527000000.csv', 'MSM_20130527030000.csv', 'MSM_20130527060000.csv', 'MSM_20130527090000.csv', 
'MSM_20130527120000.csv', 'MSM_20130527150000.csv', 'MSM_20130527180000.csv', 'MSM_20130527210000.csv', 
'MSM_20130528000000.csv', 'MSM_20130528030000.csv', 'MSM_20130528060000.csv', 'MSM_20130528090000.csv', 
'MSM_20130528120000.csv', 'MSM_20130528150000.csv', 'MSM_20130528180000.csv', 'MSM_20130528210000.csv', 
'MSM_20130529000000.csv', 'MSM_20130529030000.csv', 'MSM_20130529060000.csv', 'MSM_20130529090000.csv']

print files_in_input_directory
print "\n"

# trying to match any string with 000000, 060000, 120000, 180000
# Question: I use + meaning one or more, and | to indicates the options, but this will match
# 'MSM_20130519210000.csv' as well, and I don't know why
print filter(lambda x:re.search(r'[00|06|12|18]+0000', x), files_in_input_directory)
print "\n"

# This verbose version works
print filter(lambda x:re.search(r'0000000|060000|120000|180000', x), files_in_input_directory)
print "\n"

If you are trying to match filenames that contain 000000 , 060000 , 120000 or 180000 , then instead of 如果你想匹配包含文件名000000060000120000180000 ,然后代替

re.search(r'[00|06|12|18]+0000', x)

use 采用

re.search(r'(00|06|12|18)0000', x)

The square brackets [...] only match a single character at a time, and the + character means "match 1 or more of the preceding expression". [...]方括号一次只匹配一个字符,而+字符则表示“匹配前面的表达式中的1个或多个 ”。

[00|06|12|18] is the character set matching 00|06|12|18 . [00|06|12|18]是与00|06|12|18匹配的字符集 Thus it will match 210000 in "SM_20130519210000.csv" because [00|06|12|18] is equivalent to writing [01268]. 因此,它将匹配“ SM_20130519210000.csv”中的210000 ,因为[00|06|12|18]等效于写入[01268]。 Not what you meant, I should think. 我想的不是你的意思。

Instead of expressing a character set that can match one or more times, make it either a capturing group 与其表示可以匹配一次或多次的字符集,不如使其成为捕获组

r'(00|06|12|18)0000'

Or a negative lookbehind expression 或负向后看表达式

r'(?<=00|06|12|18)0000'

They are equivalent for your purposes, since you don't care about the match or any groups. 对于您的目的,它们是等效的,因为您不关心比赛或任何组。

The basic problem here is you were not grouping the patterns, but creating a character set fo match against using ``[ ... ]```. 这里的基本问题是,您不是在对模式进行分组,而是在不使用``[...]`''的情况下创建字符集。

This regex works: ((000)|(06)|(12)|(18))0000 此正则表达式起作用: ((000)|(06)|(12)|(18))0000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM