简体   繁体   中英

use regex to extract multiple strings following certain pattern

I have a long string like this and I want to extract all items after Invalid items , so I expect regex returns a list like ['abc.def.com', 'bar123', 'hello', 'world', '1212', '5566', 'aaaa']

I tried using this pattern but it gives me one group per match

import re
test = 'Valid items: (aaa.com; bbb.com); Invalid items: (abc.def.com;); Valid items: (foo123;); Invalid items: (bar123;); Valid items: (1234; 5678; abcd;); Invalid items: (hello; world; 1212; 5566; aaaa;)'
re.findall(r'Invalid items: \((.+?);\)', test)
# ['abc.def.com', 'bar123', 'hello; world; 1212; 5566; aaaa']

Is there a better way to do this with regex?

thanks

If you want to return all the matches individually using only a single findall , then you'll need to make use of positive lookbehind, eg (?<=foo) . Python module re unfortunately only supports fixed-width lookbehind. However, if you're willing to use the outstanding regex module, then it can be done.

Regex:

(?<=Invalid items: \([^)]*)[^ ;)]+

Demonstration: https://regex101.com/r/p90Z81/1

If there can be empty items, a small modification to the regex allows capture of these zero-width matches, as follows:

(?<=Invalid items: \([^)]*)(?:[^ ;)]+|(?<=\(| ))

Using re , you can split the matched groups on a semicolon and a space

import re
test = 'Valid items: (aaa.com; bbb.com); Invalid items: (abc.def.com;); Valid items: (foo123;); Invalid items: (bar123;); Valid items: (1234; 5678; abcd;); Invalid items: (hello; world; 1212; 5566; aaaa;)'
results = []
for s in re.findall(r'Invalid items: \((.+?);\)', test):
     results = results + s.split(r"; ")

print(results)

Output

['abc.def.com', 'bar123', 'hello', 'world', '1212', '5566', 'aaaa']

See a Python demo .

This will pick only the desired pattern that is mentioned in valid or invalid

import re
test = 'Valid items: (abc.h; bac.h); Invalid items: (aaa.123;); Valid items: (aaa H;bbbb H;); Invalid items: (abc;bac;)'
results = []
for s in re.findall(r'Invalid items: \((.+?);\)', test):
     results = results + s.split(r" ; ")
 
print(results)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM