简体   繁体   English

使用 Python 使用正则表达式查找重叠序列

[英]Finding overlapping sequence with regular expressions with Python

I'm trying to extract numbers and both previous and following characters (excluding digits and whitespaces) of a string.我正在尝试提取字符串的数字以及前后字符(不包括数字和空格)。 The expected return of the function is a list of tuples, with each tuple having the shape:该函数的预期返回是一个元组列表,每个元组具有以下形状:

(previous_sequence, number, next_sequence)

For example:例如:

string = '200gr T34S'
my_func(string)
>>[('', '200', 'gr'), ('T', '34', 'S')]

My first iteration was to use:我的第一次迭代是使用:

def my_func(string):
    res_obj = re.findall(r'([^\d\s]+)?(\d+)([^\d\s]+)?', string)

But this function doesn't do what I expect when I pass a string like '2AB3' I would like to output [('','2','AB'), ('AB','3','')] and instead, it is showing [('','2','AB'), ('','3','')] , because 'AB' is part of the previous output.但是当我传递像'2AB3'这样的字符串时,这个函数并没有像我期望的那样做我想输出[('','2','AB'), ('AB','3','')]而是显示[('','2','AB'), ('','3','')] ,因为 'AB' 是先前输出的一部分。

How could I fix this?我怎么能解决这个问题?

Instead of modifier + and ?而不是修饰符+? you can simply use * :你可以简单地使用*

>>> re.findall(r'([^\d\s]*)(\d+)([^\d\s]*)',string)
[('', '200', 'gr'), ('T', '34', 'S')]

But if you mean to match the overlapped strings you can use a positive look ahead to fine all the overlapped matches :但是如果你想匹配重叠的字符串,你可以使用积极的前瞻性来细化所有重叠的匹配:

>>> re.findall(r'(?=([^\d\s]*)(\d+)([^\d\s]*))','2AB3')
[('', '2', 'AB'), ('AB', '3', ''), ('B', '3', ''), ('', '3', '')]

Since there is no overlapping numbers, a single trailing由于没有重叠的数字,单个尾随
assertion should be all you need.断言应该是你所需要的。

Something like ([^\\d\\s]+)?(\\d+)(?=([^\\d\\s]+)?)([^\\d\\s]+)?(\\d+)(?=([^\\d\\s]+)?)

This ([^\\d\\s]*)(\\d+)(?=([^\\d\\s]*)) if you care about这个([^\\d\\s]*)(\\d+)(?=([^\\d\\s]*))如果你关心
the difference between NULL and the empty string. NULL 和空字符串之间的区别。

Another way can be using regex and functions!另一种方法是使用正则表达式和函数!

import re

#'200gr T34S'  '2AB3'


def s(x):
    tmp=[]
    d = re.split(r'\s+|(\d+)',x)
    d = ['' if v is None else v for v in d] #remove None
    t_ = [i for i in d if len(i)>0]
    digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    nms = [i for i in t_ if i[0] in digits]
    for i in nms:       
        if d.index(i)==0:
            tmp.append(('',i,d[d.index(i)+1]))
        elif d.index(i)==len(d):
            tmp.append((d[d.index(i)-1],i,''))
        else:
            tmp.append((d[d.index(i)-1],i,d[d.index(i)+1]))
    return tmp

print s('2AB3')

Prints-印刷-

[('', '2', 'AB'), ('AB', '3', '')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM