简体   繁体   English

从字符串列表中获取子字符串列表,其中子字符串与某个正则表达式匹配

[英]Get a list of substrings from a list of strings where the substrings match a certain regular expression

This question is for Python 3.6+ (but feel free to answer for lower Pythons for other readers). 这个问题是针对Python 3.6以上版本的(但对于其他读者来说,较低版本的Python可以随意回答)。

I want to extract a substring from each string that matches a regular expression. 我想从每个与正则表达式匹配的字符串中提取一个子字符串。

Say I have the following: 说我有以下几点:

a = ['v-01-001', 'v-01-002', 'v-02-001', 'v-02-002', 'v-02-003', 'v-03-001']

I want the last 3 digits of all strings matching v-02-\\d\\d\\d , ie: 我想要匹配v-02-\\d\\d\\d的所有字符串的最后3位数字,即:

['001', '002', '003']

My naive attempt: 我的天真尝试:

[x[1] for x in list(map(lambda i: re.search(r'v-02-(\d\d\d)', i), a)) if x]

Can you come up with anything more elegant? 您能提出更优雅的东西吗?

Thanks 谢谢

You could do something like this: 您可以执行以下操作:

import re

a = ['v-01-001', 'v-01-002', 'v-02-001', 'v-02-002', 'v-02-003', 'v-03-001']
pattern = re.compile('v-02-(\d{3})$')
print([m.group(1) for m in map(pattern.match, a) if m])

Output 产量

['001', '002', '003']

Also you could use finditer : 你也可以使用finditer

print([m.group(1) for ms in map(pattern.finditer, a) for m in ms])

Output 产量

['001', '002', '003']

Four ways to do this. 有四种方法可以做到这一点。

The first is just a regular 'ole loop: 第一个只是常规的'ole循环:

li=[]
for s in a:
    m = re.search(r'v-02-(\d\d\d)', s)
    if m:
        li.append(m.group(1))
 # li=['001', '002', '003']

Second in two calls to the same regex in a list comprehension: 在列表理解中两次调用同一个正则表达式:

>>> [re.search(r'v-02-(\d\d\d)', s).group(1) for s in a if re.search(r'v-02-(\d\d\d)', s)]
['001', '002', '003']

Third is to use map : 第三是使用map

>>> [m.group(1) for m in map(lambda s: re.search(r'v-02-(\d\d\d)', s), a) if m]
['001', '002', '003']

Finally, you can flatten the list with .join and then use findall : 最后,您可以使用.join展平列表,然后使用findall

>>> re.findall(r'\bv-02-(\d\d\d)\b', '\t'.join(a))
['001', '002', '003']

Or, use \\n and re.M vs two \\b : 或者,使用\\nre.M与两个\\b

>>> re.findall(r'^v-02-(\d\d\d)$', '\n'.join(a), flags=re.M)
['001', '002', '003']

I would probably write this in that same order if I were writing this bit of code. 如果我编写这段代码,我可能会以相同的顺序编写。

What is considered more elegant is in the eye of the beholder I suppose. 我认为,在旁观者的眼中, 更为优雅的是。 I would consider the last one to be more elegant. 我认为最后一个更优雅。


You can also skip the regex and use Python's string methods: 您还可以跳过正则表达式,并使用Python的字符串方法:

>>> prefix='v-02-'
>>> [e[len(prefix):] for e in filter(lambda s: s.startswith(prefix),a)]
['001', '002', '003']

That would likely be the fastest if that matters in this case. 在这种情况下,那可能是最快的


In December of 2019, there will be a more elegant alternative. 在2019年12月,将有一个更优雅的选择。 As defined in PEP 572 , you will be able to use an assignment statement so you can assign the match and test the match in one step: 根据PEP 572中的定义,您将能够使用赋值语句,以便您可以分配匹配并一步测试匹配:

[m.group(1) for s in a if (m:=re.search(r'v-02-(\d\d\d)', s))]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM