简体   繁体   中英

python regex: extract list elements, each of which matches multiple patterns

I am totally new to python. Suppose I have a list as follows.

 somelist =  
['AAAA  1234   SD OXD',
 'AAAB  2342   DF BDD',
 'ERTE  3454   RE DFD',
 'GWED  1234   SD TCD',
 'AAAA  2353   SD MKX',
 'VERD  1234   IO ERT']

And I would like to extract elements that match both '1234' at position 7-10 and 'SD' at position 14-15 (just an example, could be any combination of positions, with anything in between). The result would be as follows.

['AAAA  1234   SD OXD', 'GWED  1234   SD TCD']

What I am doing now is to nest a filter() function inside another.

x = filter(lambda x: re.match('1234', x[6:10]), filter(lambda r: re.match('SD', r[13:15]), somelist))

This works but looks rather chunky and dumb. Can someone help get a solution that's more elegant and faster? The list could contain millions of elements (from lines in a file).

There are many discussions about searching/matching any of the patterns/regexes ( match A OR B ). This is to match A AND B , which must be as common a problem as the OR problem. Apparently it's gonna get messy if I want to match A and B and C and ... at different locations.

Update: Thank you all. My original question was probably not clear enough. It's basically an 'element must match ALL of several patterns at given positions' question.

Inspired by Kcorlidy 's response particularly, I gave it a few quick shots and these worked (and . indeed means 'anything' , except \\n according to the manual):

To match '1234' and 'SD' at said positions:

filter(lambda x: re.search(r'.{6}1234.{3}SD', x), somelist)

To match 'AAAA' and 'SD' at 0:4 and 13:15, respectively:

filter(lambda x: re.search(r'.{0}1234.{9}SD', x), somelist)

The take-home message is the numbers in the curly parentheses seem to mean 'distance' (number of characters) from the end of the previous pattern ('distance' from the beginning, ie ^, if it's for the first pattern), not the start position of concerned pattern. That's the whole key point. Simple stuff - that's probably why more are interested in the match A or B rather than this match A and B problem.

Are you sure you need complex regex? You could also use:

[x for x in somelist if x[5:9] == '1234' and x[10:12] == 'SD' ]
# ['AAAA 1234 SD OXD', 'GWED 1234 SD TCD']

I'm also not sure RegEx is the best solution, but this works if you do want that:

>>> regex = re.compile('.{6}1234   SD.*')
>>> x=re.findall("\n".join(somelist))
['AAAA  1234   SD OXD', 'GWED  1234   SD TCD']

Why you used two regex, actually it can finish in one regex

import re

somelist = [ 
     'AAAA  1234   SD OXD',
     'AAAB  2342   DF BDD',
     'ERTE  3454   RE DFD',
     'GWED  1234   SD TCD',
     'AAAA  2353   SD MKX',
     'VERD  1234   IO ERT',
     'AAAA 2353   SD MKX',
     'AAAA  2353  SD MKX']

print(list(filter(lambda x : re.search(r".{6}1234\s{3}SD",x) ,somelist)))
# ['AAAA  1234   SD OXD', 'GWED  1234   SD TCD']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM