简体   繁体   中英

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:

aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"

It should look like this once split and stored in the array:

aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"

I use Perl version 5.8 and could someone resolve this?

Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.

# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
@genes = $l =~ /($bp)/g;

There's an example in perlre , using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)

$re = qr{ 
            (                                # paren group 1 (full function)
                foo
                (                            # paren group 2 (parens)
                    \(
                        (                    # paren group 3 (contents of parens)
                            (?:
                                (?> [^()]+ ) # Non-parens without backtracking
                                |
                                (?2)         # Recurse to start of paren group 2
                            )*
                        )
                    \)
                )
            )
    }x;

To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre ):

$re = qr/
(                                                # paren group 1 (full function)
    foo
    (?<paren_group>                              # paren group 2 (parens)
        \(
            (                                    # paren group 3 (contents of parens)
                (?:
                    (?> (?:\\[()]|(?![()]).)+ )  # escaped parens or no parens
                    |
                    (?&paren_group)              # Recurse to named capture group
                )*
            )
        \)
    )
)
/x;

I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:

my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
         'aaa="bbb{}" { aa="b}b" }, ' .
         'aaa="bbb,ccc"'
;

my @out = ('');

my $nesting = 0;
while($in !~ m/\G$/cg)
{
  if($nesting == 0 && $in =~ m/\G,\s*/cg)
  {
    push @out, '';
    next;
  }
  if($in =~ m/\G(\{+)/cg)
    { $nesting += length $1; }
  elsif($in =~ m/\G(\}+)/cg)
  {
    $nesting -= length $1;
    die if $nesting < 0;
  }
  elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
    { }
  else
    { die; }
  $out[-1] .= $1;
}

(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the die s with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \\" ? Can ' be used instead of " ? This code doesn't handle either of those possibilities.)

Try something like this:

use strict;
use warnings;
use Data::Dumper;

my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }     , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END

chomp $exp;
my @arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\@arr);

A split solution seems simplest. Split on a lookahead of your main variable aaa , with word boundary around. Strip trailing whitespace and comma with an optional character group.

$string = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"';
my @array = split /[,\s]*(?=\baaa\b)/, $string;

Although Recursive Regular Expressions can usually be used to capture "balanced braces" {} , they won't work for you, because you ALSO have the requirement to match "balanced quotes" " .
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature ).

I would suggest creating your own parser. As you process each character, you count each " and {} , and only split on , if they are "balanced".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM