简体   繁体   中英

Error in regular expression to find the text between parenthesis

I have a string

string  ='((clearance) AND (embedded) AND (software engineer OR developer)) AND (embedded)'

I want to break into lists based on the parenthesis, so referring solutions given I have used

my_data = re.findall(r"(\(.*?\))",string)

but when I print my_data, the output is (len = 4)

['((clearance)', '(embedded)', '(software engineer OR developer)', '(embedded)']

but my desired output is (len = 2)

['(clearance) AND (embedded) AND (software engineer OR developer)', '(embedded)']

because "(clearance) AND (embedded) AND (software engineer OR developer)" is in one parenthesis and "embedded" is in another parenthesis. but the "re.findall" is breaking in 4 lists, why?

If I want my desired output, how to modify the regular expression?

In pure regex, this would not be possible, so here is an idea that counts parenthesis:

def find_stuff(string):
    indices = []
    counter = 0
    change = {"(":1, ")":-1}
    for i, el in enumerate(string):
        new_count = counter + change.get(el, 0)
        if counter==0 and new_count==1:
            indices.append(i)
        elif counter==1 and new_count==0:
            indices.append(i+1)
        counter = new_count
    return indices

This is not very beautiful, but I think the concept is clear. It returns the indices of outer parenthesis, so you can just slice your string with these

A bit of an re hack, but this is possible:

>>> string  ='((clearance) AND (embedded) AND (software engineer OR developer)) AND (embedded)'
>>> [e for e in re.split(r'\((?=\()(.*?)(?<=\))\)|(?<!\()(\([^()]+\))(?!\))',string) if e and '(' in e and ')' in e]
['(clearance) AND (embedded) AND (software engineer OR developer)', '(embedded)']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM