I'd like to match (and replace with a custom replacement function) each block of consecutive lines that all start by foo
. This nearly works:
import re
s = """bar6387
bar63287
foo1234
foohelloworld
fooloremipsum
baz
bar
foo236
foo5382
bar
foo879"""
def f(m):
print(m)
s = re.sub('(foo.*\n)+', f, s)
print(s)
# <re.Match object; span=(17, 53), match='foo1234\nfoohelloworld\nfooloremipsum\n'>
# <re.Match object; span=(61, 76), match='foo236\nfoo5382\n'>
but it fails to recognize the last block, obviously because it is the last line and there is no \n
at the end.
Is there a cleaner way to match a block of one or multiple consecutive lines starting with same pattern foo
?
Here is an re.findall
approach:
s = """bar6387
bar63287
foo1234
foohelloworld
fooloremipsum
baz
bar
foo236
foo5382
bar
foo879"""
lines = re.findall(r'^foo.*(?:\nfoo.*(?=\n|$))*', s, flags=re.M)
print(lines)
# ['foo1234\nfoohelloworld\nfooloremipsum',
'foo236\nfoo5382',
'foo879']
The above regex runs in multiline mode, and says to match:
^ from the start of a line
foo "foo"
.* consume the rest of the line
(?:\nfoo.*(?=\n|$))* match newline and another "foo" line, 0 or more times
Edit:
If you need to replace/remove these blocks, then use the same pattern with re.sub
and a lambda callback:
output = re.sub(r'^foo.*(?:\nfoo.*(?=\n|$))*', lambda m: "BLAH", s, flags=re.M)
print(output)
This prints:
bar6387
bar63287
BLAH
baz
bar
BLAH
bar
BLAH
Do you really need a regex? Here is a itertools.groupby
based approach:
from itertools import groupby
import re
# dummy example function
f = lambda x: '>>'+x.upper()+'<<'
out= '\n'.join(f(G) if (G:='\n'.join(g)) and k else G
for k,g in groupby(s.split('\n'), lambda l: l.startswith('foo')))
print(out)
NB. you don't need a regex, but you can also use a regex if needed to define the matching lines in groupby
# using a regex to match the blocks:
out= '\n'.join(f(G) if (G:='\n'.join(g)) and k else G
for k,g in groupby(s.split('\n'),
lambda l: bool(re.match('foo', l))
))
ouput:
bar6387
bar63287
>>FOO1234
FOOHELLOWORLD
FOOLOREMIPSUM<<
baz
bar
>>FOO236
FOO5382<<
barfoo
bar
>>FOO879<<
You can use
re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)
re.sub(r'^foo.*(?:\nfoo.*)*', f, s, flags=re.M)
where
^
- matches start of string (here, a start of any line due to (?m)
or re.M
option) foo
- matches foo
.*
- any zero or more chars other than line break chars as many as possible (?:\nfoo.*)*
- zero or more sequences of a newline, foo
and then the rest of the line. See the Python demo :
import re
s = "bar6387\nbar63287\nfoo1234\nfoohelloworld\nfooloremipsum\nbaz\nbar\nfoo236\nfoo5382\nbar\nfoo879"
def f(m):
print(m.group().replace('\n', r'\n'))
re.sub(r'(?m)^foo.*(?:\nfoo.*)*', f, s)
Output:
foo1234\nfoohelloworld\nfooloremipsum
foo236\nfoo5382
foo879
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.