简体   繁体   中英

regex to extract nested patterns in parentheses and square brackets

I have

(LEFT-WALL)(who)(is.v)(Obama)(,)(Ip)(love.v)(his)(speech.s)(RIGHT-WALL)

sort of pattern, which I split out and get each parenthesis item in list. my regex works fine, but for nested text like (Ob(am)a)

Example:

post_script_word_str = '(LEFT-WALL)(who)(is.v)(Obama)(,)(I.p)(love.v)(his)(speech.s)(RIGHT-WALL)'
post_script_word_list = re.compile(r'\(([^\)\(]*)\)').split(post_script_word_str)
print post_script_word_list

post_script_link_str = '[0 12 4 (RW)][0 7 3 (Xx)][0 1 0 (Wd)][1 2 0 (Ss)][2 6 2 (Ost)][3 6 1 (Ds)][3 4 0 (La)][5 6 0 (AN)][7 8 0 (Wq)][8 9 0 (EAh)][9 10 0 (AF)][10 11 0 (SIs)]'
post_script_link_str = re.compile(r'\[([^\]\[]*)\]').split(post_script_link_str)
print post_script_link_str

result:

    ['', 'LEFT-WALL', '', 'who', '', 'is.v', 'Obama', ',', '', 'I.p', '', 'love.v', '', 'his', '', 'speech.s', '', 'RIGHT-WALL', '']

['', '0 12 4 (RW)', '', '0 7 3 (Xx)', '', '0 1 0 (Wd)', '', '1 2 0 (Ss)', '', '2 6 2 (Ost)', '', '3 6 1 (Ds)', '', '3 4 0 (La)', '', '5 6 0 (AN)', '', '7 8 0 (Wq)', '', '8 9 0 (EAh)', '', '9 10 0 (AF)', '', '10 11 0 (SIs)', '']

but for input like (Ob(am)a) or [0 [1]2 4 (RW)] it fails. I expect same result as above but it gives

['', 'LEFT-WALL', '', 'who', '', 'is.v', '(Ob', 'am', 'a)', ',', '', 'I.p', '', 'love.v', '', 'his', '', 'speech.s', '', 'RIGHT-WALL', '']

['[0 ', '1', '2 4 (RW)]', '0 7 3 (Xx)', '', '0 1 0 (Wd)', '', '1 2 0 (Ss)', '', '2 6 2 (Ost)', '', '3 6 1 (Ds)', '', '3 4 0 (La)', '', '5 6 0 (AN)', '', '7 8 0 (Wq)', '', '8 9 0 (EAh)', '', '9 10 0 (AF)', '', '10 11 0 (SIs)', '']

any suggestion?

Updated input :

post_script_link_str = '[0 [1]2 4 (RW)][0 7 3 (Xx)][0 1 0 (Wd)][1 2 0 (Ss)][2 6 2 (Ost)][3 6 1 (Ds)][3 4 0 (La)][5 6 0 (AN)][7 8 0 (Wq)][8 9 0 (EAh)][9 10 0 (AF)][10 11 0 (SIs)]'

result :

['[0 ', '1', '2 4 (RW)]', '0 7 3 (Xx)', '', '0 1 0 (Wd)', '', '1 2 0 (Ss)', '', '2 6 2 (Ost)', '', '3 6 1 (Ds)', '', '3 4 0 (La)', '', '5 6 0 (AN)', '', '7 8 0 (Wq)', '', '8 9 0 (EAh)', '', '9 10 0 (AF)', '', '10 11 0 (SIs)', '']

The re module is unable to deal with nested structures. You need to use the new regex module that has the recursion feature. As an aside, I think that the findall method is more appropriate for this job:

regex.findall(r'\[([^][]*+(?:(?R)[^][]*)*+)]', post_script_link_str)

You obtain:

['0 [1]2 4 (RW)', '0 7 3 (Xx)', '0 1 0 (Wd)', '1 2 0 (Ss)', '2 6 2 (Ost)', '3 6 1 (Ds)', '3 4 0 (La)', '5 6 0 (AN)', '7 8 0 (Wq)', '8 9 0 (EAh)', '9 10 0 (AF)', '10 11 0 (SIs)']

All you need now is to map the list to remove square brackets.

pattern details:

(?R) allows the recursion since it is an alias for the whole pattern.

*+ is a possessive quantifier. It's the same than * but doesn't allow the regex engine to backtrack. It is used here to prevent a catastrophic backtracking if unfortunately brackets are not balanced.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM