[英]Finding overlapping sequence with regular expressions with Python
I'm trying to extract numbers and both previous and following characters (excluding digits and whitespaces) of a string.我正在尝试提取字符串的数字以及前后字符(不包括数字和空格)。 The expected return of the function is a list of tuples, with each tuple having the shape:
该函数的预期返回是一个元组列表,每个元组具有以下形状:
(previous_sequence, number, next_sequence)
For example:例如:
string = '200gr T34S'
my_func(string)
>>[('', '200', 'gr'), ('T', '34', 'S')]
My first iteration was to use:我的第一次迭代是使用:
def my_func(string):
res_obj = re.findall(r'([^\d\s]+)?(\d+)([^\d\s]+)?', string)
But this function doesn't do what I expect when I pass a string like '2AB3'
I would like to output [('','2','AB'), ('AB','3','')]
and instead, it is showing [('','2','AB'), ('','3','')]
, because 'AB' is part of the previous output.但是当我传递像
'2AB3'
这样的字符串时,这个函数并没有像我期望的那样做我想输出[('','2','AB'), ('AB','3','')]
而是显示[('','2','AB'), ('','3','')]
,因为 'AB' 是先前输出的一部分。
How could I fix this?我怎么能解决这个问题?
Instead of modifier +
and ?
而不是修饰符
+
和?
you can simply use *
:你可以简单地使用
*
:
>>> re.findall(r'([^\d\s]*)(\d+)([^\d\s]*)',string)
[('', '200', 'gr'), ('T', '34', 'S')]
But if you mean to match the overlapped strings you can use a positive look ahead to fine all the overlapped matches :但是如果你想匹配重叠的字符串,你可以使用积极的前瞻性来细化所有重叠的匹配:
>>> re.findall(r'(?=([^\d\s]*)(\d+)([^\d\s]*))','2AB3')
[('', '2', 'AB'), ('AB', '3', ''), ('B', '3', ''), ('', '3', '')]
Since there is no overlapping numbers, a single trailing由于没有重叠的数字,单个尾随
assertion should be all you need.断言应该是你所需要的。
Something like ([^\\d\\s]+)?(\\d+)(?=([^\\d\\s]+)?)
像
([^\\d\\s]+)?(\\d+)(?=([^\\d\\s]+)?)
This ([^\\d\\s]*)(\\d+)(?=([^\\d\\s]*))
if you care about这个
([^\\d\\s]*)(\\d+)(?=([^\\d\\s]*))
如果你关心
the difference between NULL and the empty string. NULL 和空字符串之间的区别。
Another way can be using regex and functions!另一种方法是使用正则表达式和函数!
import re
#'200gr T34S' '2AB3'
def s(x):
tmp=[]
d = re.split(r'\s+|(\d+)',x)
d = ['' if v is None else v for v in d] #remove None
t_ = [i for i in d if len(i)>0]
digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
nms = [i for i in t_ if i[0] in digits]
for i in nms:
if d.index(i)==0:
tmp.append(('',i,d[d.index(i)+1]))
elif d.index(i)==len(d):
tmp.append((d[d.index(i)-1],i,''))
else:
tmp.append((d[d.index(i)-1],i,d[d.index(i)+1]))
return tmp
print s('2AB3')
Prints-印刷-
[('', '2', 'AB'), ('AB', '3', '')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.