简体   繁体   English

正则表达式在数字范围内查找匹配项

[英]Regex find match within number range

I have a series of files with the following naming convention..."2020.01.01 W1 Forecast.xlsm".我有一系列具有以下命名约定的文件......“2020.01.01 W1 Forecast.xlsm”。 I am trying to loop through a directory while searching the file title pattern that matches the year 2020 and greater or at least a broader range (ie 2020-2030) so I don't have to alter my script every year.我试图在搜索与 2020 年或更大范围(即 2020-2030 年)匹配的文件标题模式时遍历目录,因此我不必每年都更改我的脚本。 I've tried the following but have been unsuccessful in getting the pattern to match anything other than the current year of 2020. The naming convention starts with the year string.我尝试了以下方法,但未能成功让模式匹配除 2020 年之外的任何内容。命名约定以年份字符串开头。

path_str = '/Users/X/Desktop/Test_Directory/'

pattern_str = '*2020.*Forecast.xlsm'

p = Path(path_str)
files = p.rglob(pattern_str)

for file in files:
    print(file)

sample output:样品 output:

/Users/X/Desktop/Test_Directory/2020.08.03 Week 32 Forecast.xlsm
/Users/X/Desktop/Test_Directory/2020.01.06 Week 2 Forecast.xlsm
/Users/X/Desktop/Test_Directory/2020.06.18 Week 25 Forecast.xlsm
/Users/X/Desktop/Test_Directory/2020.06.22 Week 26 Forecast.xlsm

Any help or direction is greatly appreciated.非常感谢任何帮助或指导。

Here's what you're looking for: '^20[2-9][0-9].+(\.xlsm)$'这就是您要查找的内容: '^20[2-9][0-9].+(\.xlsm)$'

It says start with 2020 thru 2099, followed by any character .它说从 2020 到 2099 开始,然后是任何字符. one or more times + , and end with xlsm (\.xlsm)$ .一次或多次+ ,并以 xlsm (\.xlsm)$结尾。 Note the backslash in the last part.注意最后一部分的反斜杠。 It is required to escape the period, otherwise it will interrupt it as any character.需要转义句号,否则会像任何字符一样打断它。

I am not sure how far you want to go, but if your goal is only to identify the year in a range from 2020-2030, than this is your regular expresion for you complete path: ^.*20(2\d|30).*$ .我不确定您想要 go 走多远,但如果您的目标只是确定 2020-2030 范围内的年份,那么这是您完整路径的常规表达式: ^.*20(2\d|30).*$ .

Because your are working with a path, I would suggest that you split your string on the last slash / before using the regular expression on the last item of the list.因为您正在使用路径,所以我建议您在最后一个斜杠/上拆分字符串,然后再在列表的最后一项上使用正则表达式。 Than you are able to specify your regular expression for the file name.比您能够为文件名指定正则表达式。

Maybe this will help:也许这会有所帮助:

import re
for file in files:
    my_string = file.split('/')[-1]
    match = re.find('^20(2\d|30).*\.xml$', my_string)
    if match:
        print(file)

Maybe try yourself with this tool .不妨试试这个工具

I also want to add some more information about regular expressions, so you can understand what is going on.我还想添加一些有关正则表达式的更多信息,以便您了解发生了什么。

  1. ^ - This looks for the start of the string. ^ - 这将查找字符串的开头。 The reason why some answers were not successful so far.到目前为止,一些答案没有成功的原因。

  2. . - This looks for any symbol. - 这会寻找任何符号。 So you can easily overcome some uninteresting parts.所以你可以轻松克服一些无趣的部分。 But be careful, because of this you have to specify a dot like this \.但要小心,因此您必须指定一个像这样的点\.

  3. $ - This means the end of the string. $ - 这意味着字符串的结尾。

  4. \d - this is a synonym for digit and matches [0-9] \d - 这是数字的同义词,匹配[0-9]

  5. * - this is the greedy symbol. * - 这是贪婪的符号。 This trys to match from zero to as many items as possible of the wanted type.这会尝试从零匹配所需类型的尽可能多的项目。 Examples:例子:

    a.一个。 .* - This try to find as many symbols as possible, no type defines. .* - 这试图找到尽可能多的符号,没有类型定义。

    b.湾。 \d* - This trys to find as many digits as possible. \d* - 这试图找到尽可能多的数字。

  6. + - this is also a greedy symbol but has to match at least once. + - 这也是一个贪心符号,但必须至少匹配一次。

In your second pattern, you're missing a .在您的第二种模式中,您缺少一个. wildcard after the year, You probably want年后的通配符,你可能想要

^(202[0-9]|2030).*Forecast\.xlsm

rather than而不是

^(202[0-9]|2030)*Forecast.xlsm

You can use a site like https://regexr.com/ to experiment with regexes.您可以使用https://regexr.com/ 之类的网站来试验正则表达式。

But you might want to consider fetching the most recent files with programming logic instead of regexes, you could parse the file name and select a date range eg using datetime .但是您可能需要考虑使用编程逻辑而不是正则表达式来获取最新文件,您可以解析文件名和 select 日期范围,例如使用datetime


Update更新

Starting from your updated code:从您更新的代码开始:

import datetime
path_str = '/Users/X/Desktop/Test_Directory/'
pattern_str = '*Forecast.xlsm'  # All your report files

p = Path(path_str)
files = p.rglob(pattern_str)

for file in files:
    # # uncomment in case there are different patterns in that folder:
    # if not re.match(r"\d{4}\.\d{2}.\d{2}.*", file.name): continue
    date = datetime.datetime.strptime(file.name[:10], "%Y.%m.%d")
    current_year = datetime.datetime.today().year
    if date > datetime.datetime(current_year, 1, 1):
        print(date)

This will filter your list of files for names in the current year.这将在您的文件列表中过滤当前年份的名称。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM