简体   繁体   English

正则表达式:匹配下划线和句点之间的所有字符

[英]Regex: Match all characters in between an underscore and a period

I have a set of file names in which I need to extract their dates.我有一组文件名,我需要在其中提取它们的日期。 The file names look like:文件名如下所示:

['1 120836_1_20210101.csv',
 '1 120836_1_20210108.csv',
 '1 120836_20210101.csv',
 '1 120836_20210108.csv',
 '10 120836_1_20210312.csv',
 '10 120836_20210312.csv',
 '11 120836_1_20210319.csv',
 '11 120836_20210319.csv',
 '12 120836_1_20210326.csv',
 ...
]

As an example, I would need to extract 20210101 from the first item in the list above.例如,我需要从上面列表的第一项中提取20210101

Here is my code but it is not working - I'm not totally familiar with regex.这是我的代码,但它不起作用 - 我对正则表达式并不完全熟悉。

import re
dates = []
for file in files:
    dates.extend(re.findall("(?<=_)\d{}(?=\d*\.)", file))

You weren't that far off, but there were a few issues:你不是那么遥远,但有几个问题:

  • you extend dates by the result of the .findall , but you only expect to find one and are constructing all of dates , so that would be a lot simpler with a re.search in a list comprehension您通过.findall的结果扩展dates ,但您只希望找到一个并且正在构建所有dates ,因此使用列表理解中的re.search会简单得多
  • your regex has a few unneeded complications (and some bugs)您的正则表达式有一些不必要的并发症(和一些错误)

This is what you were after:这就是你所追求的:

import re

files = [
    '1 120836_1_20210101.csv',
    '1 120836_1_20210108.csv',
    '1 120836_20210101.csv',
    '1 120836_20210108.csv',
    '10 120836_1_20210312.csv',
    '10 120836_20210312.csv',
    '11 120836_1_20210319.csv',
    '11 120836_20210319.csv',
    '12 120836_1_20210326.csv'
]

dates = [re.search(r"(?<=_)\d+(?=\.)", fn).group(0) for fn in files]

print(dates)

Output:输出:

['20210101', '20210108', '20210101', '20210108', '20210312', '20210312', '20210319', '20210319', '20210326']

It keeps the lookbehind for an underscore, and changes the lookahead to look for a period.它保留下划线的lookbehind,并更改lookahead 以查找一个句点。 It just matches all digits (at least one, with + ) in between the two.它只匹配两者之间的所有数字(至少一个,带有+ )。

Note that the r in front of the string avoids having to double up the backslashes in the regex, the backslashes in \d and \.请注意,字符串前面的r避免了将正则表达式中的反斜杠、 \d\. are still required to indicate a digit and a literal period.仍然需要指示一个数字和一个文字句点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM