简体   繁体   English

匹配 Perl 正则表达式中的平衡括号

[英]Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:我有一个需要拆分并存储在数组中的表达式:

aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"

It should look like this once split and stored in the array:一旦拆分并存储在数组中,它应该看起来像这样:

aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"

I use Perl version 5.8 and could someone resolve this?我使用 Perl 5.8 版,有人可以解决这个问题吗?

Use the perl module "Regexp::Common".使用 perl 模块“Regexp::Common”。 It has a nice balanced parenthesis Regex that works well.它有一个很好的平衡括号正则表达式,效果很好。

# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
@genes = $l =~ /($bp)/g;

There's an example in perlre , using the recursive regex features introduced in v5.10. perlre 中有一个示例,使用了 v5.10 中引入的递归正则表达式功能。 Although you are limited to v5.8, other people coming to this question should get the right solution :)虽然你仅限于 v5.8,但其他人来回答这个问题应该会得到正确的解决方案:)

$re = qr{ 
            (                                # paren group 1 (full function)
                foo
                (                            # paren group 2 (parens)
                    \(
                        (                    # paren group 3 (contents of parens)
                            (?:
                                (?> [^()]+ ) # Non-parens without backtracking
                                |
                                (?2)         # Recurse to start of paren group 2
                            )*
                        )
                    \)
                )
            )
    }x;

To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work.要匹配平衡括号或大括号,并且如果您想考虑反斜杠(转义),建议的解决方案将不起作用。 Instead, you would write something like this (building on the suggested solution in perlre ):相反,你会写这样的东西(建立在perlre 中建议的解决方案):

$re = qr/
(                                                # paren group 1 (full function)
    foo
    (?<paren_group>                              # paren group 2 (parens)
        \(
            (                                    # paren group 3 (contents of parens)
                (?:
                    (?> (?:\\[()]|(?![()]).)+ )  # escaped parens or no parens
                    |
                    (?&paren_group)              # Recurse to named capture group
                )*
            )
        \)
    )
)
/x;

I agree with Scott Rippey, more or less, about writing your own parser.我或多或少同意 Scott Rippey 关于编写自己的解析器的看法。 Here's a simple one:这是一个简单的:

my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
         'aaa="bbb{}" { aa="b}b" }, ' .
         'aaa="bbb,ccc"'
;

my @out = ('');

my $nesting = 0;
while($in !~ m/\G$/cg)
{
  if($nesting == 0 && $in =~ m/\G,\s*/cg)
  {
    push @out, '';
    next;
  }
  if($in =~ m/\G(\{+)/cg)
    { $nesting += length $1; }
  elsif($in =~ m/\G(\}+)/cg)
  {
    $nesting -= length $1;
    die if $nesting < 0;
  }
  elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
    { }
  else
    { die; }
  $out[-1] .= $1;
}

(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the die s with something application-specific. (在 Perl 5.10 中测试;抱歉,我手边没有 Perl 5.8,但据我所知,没有任何相关差异。)不用说,您需要用特定于应用程序的东西替换die s . And you'll likely have to tweak the above to handle cases not included in your example.而且您可能需要调整上述内容以处理示例中未包含的情况。 (For example, can quoted strings contain \\" ? Can ' be used instead of " ? This code doesn't handle either of those possibilities.) (例如,带引号的字符串可以包含\\"吗?可以使用'代替"吗?此代码不处理这两种可能性中的任何一种。)

Try something like this:尝试这样的事情:

use strict;
use warnings;
use Data::Dumper;

my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }     , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END

chomp $exp;
my @arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\@arr);

A split solution seems simplest. 拆分解决方案似乎最简单。 Split on a lookahead of your main variable aaa , with word boundary around. 拆分主变量aaa的前瞻, aaa有单词边界。 Strip trailing whitespace and comma with an optional character group. 使用可选字符组删除尾随空格和逗号。

$string = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"';
my @array = split /[,\s]*(?=\baaa\b)/, $string;

Although Recursive Regular Expressions can usually be used to capture "balanced braces" {} , they won't work for you, because you ALSO have the requirement to match "balanced quotes" " .尽管递归正则表达式通常可用于捕获“平衡大括号” {} ,但它们对您不起作用,因为您还需要匹配“平衡引号” "
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible.对于 Perl 正则表达式来说,这将是一项非常棘手的任务,我很确定这是不可能的。 (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature ). (相比之下,它可能可以通过Microsoft 的“平衡组”正则表达式功能来完成)。

I would suggest creating your own parser.我建议创建自己的解析器。 As you process each character, you count each " and {} , and only split on , if they are "balanced".当您处理每个字符时,您会计算每个"{} ,如果它们是“平衡的”,则只在 上拆分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM