简体   繁体   中英

bash_rematch and regex (with nested parens)

I'm having trouble with a regex, I need search and remove the pattern matching the regex, when found I need to trim out. I wrote a regex like that

regex='(.*)((aa[[:space:]]bb)|(awd)|(bab)|(bc[[:space:]]d))(*.)'

in which I define all the beginning (1), the portion in which can be the target (2) and all the ending (3). It's easy with simple regex like (. )(abc)(. ) string="abc"; regex='( .)(abc)(. )'

[[ $string =~ $regex) && myvar=${BASH_REMATCH[2]} && buffer=${BASH_REMATCH[1]}${BASH_REMATCH[3]}

The trouble begin when I define a regex with nested parens and OR groups, like the first regex posted here. This is a sample from my shell:

$ string=" foo bar baz bac"
$ regex='(.*)((hello[[:space:]]world)|(example)|(funk[[:space:]]you)|(bar[[:space:]]baz))(.*)'

$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[1]}
foo
$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[2]}
bar baz
$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[3]}

$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[4]}

$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[5]}

$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[6]}
bar baz
$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[7]}
bac
$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[@]}
foo bar baz bac foo bar baz bar baz bac

The matching have a strange behaviour, I don't find the other portion of the input string in ${BASH_REMATCH[3]} although is in the 3rd parens of the regex. What's happen with nested parens?

bash assigns numbers to the capture groups based on a left-to-right ordering of the opening parentheses. Basically, it's a depth-first ordering, not breadth-first like you are assuming.

1. (.*)
2. (
3.   (hello[[:space:]]world)|
4.   (example)|
5.   (funk[[:space:]]you)|
6.   (bar[[:space:]]baz)
   )
7. (.*)

In this regular expression, group 2 is essentially a copy of whichever of groups 3, 4, 5 or 6 actually matches, since group 2 contains nothing else. Group 7 is what you think of as the 3rd parenthesis group.

Group 0 is the entire match, which explains your last line using @ :

$ [[ $string =~ $regex ]] && echo ${BASH_REMATCH[@]}
foo bar baz bac foo bar baz bar baz bac
|             | | | |     | |     | | |
+-------------+ +-+ +-----+ +-----+ +-+
       0         1     2       6     7

(The empty groups 3, 4, and 5 are swallowed up as whitespace during word-splitting.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM