Perl多行字符串正则表达式

Question

我尝试通过逐行读取文件来查找文件中的所有字符串（在“或”之间）。

my @strings = ();
open FILE, $file or die "File operation failed: $!";
foreach my $line (<FILE>) {
    push(@strings, $1) if /(['"].*['"])/g;
}
close FILE;

问题是此代码仅适用于一行上的字符串。

print "single line string";

但是我也必须匹配多行字符串，例如：

print "This is a
multiligne
string";

我能怎么做？

顺便说一句，我知道我的正则表达式还不够好。 因为它应该匹配以“开头”和以“ "not correct string' （与单引号相同），但是如果我们有"not correct string'

更新：我的新代码是

my @strings = ();
open FILE, $file or die "File operation failed: $!";
local $/;
foreach my $line (<FILE>) {
    push(@strings, grep { defined and /["']/ } quotewords('\s+', 1, $_));
}
close FILE;

但是如果数据是：

print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

我应该得到：

"single line \n"
"This is a
multiline
string"
'single quote string'
"string with variable "
" after variable"

Answer 1

以下是两个用于解析单引号或双引号的正则表达式。 请注意，我对所有数据进行了筛选，以便能够捕获多行字符串：

use strict;
use warnings;

my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};

my $data = do {local $/; <DATA>};

while ($data =~ /($squo_re|$dquo_re)/g) {
    print "<$1>\n";
}

__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

但是，由于您正在尝试解析perl代码，因此，最干净的方法是使用PPI ：

use strict;
use warnings;

use PPI;

my $src = do {local $/; <DATA>};

# Load a document
my $doc = PPI::Document->new( \$src );

# Find all the barewords within the doc
my $strings = $doc->find( 'PPI::Token::Quote' );
for (@$strings) {
    print '<', $_->content, ">\n";
}

__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

两种方法的输出：

<"single line \n">
<"This is a
multiline
string">
<'single quote string'>
<"string with variable ">
<" after variable">

更新有关（？> ...）

以下是带双引号的正则表达式的带注释的版本。

my $dquo_re = qr{
    "
        (?:                # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
            (?>            # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
                [^"\\]*    # All characters NOT a " or \
            )
        |
            \\.            # Backslash followed by any escaped character
        )*                 # Any number of the preceeding or'd group
    "
    }x;

独立的子表达式 (?> ... )实际上对于此正则表达式而言不是必需的。 这样做是为了防止回溯，因为只有一种方式可以使带引号的字符串匹配，要么使用上述规则找到结尾的引号，要么不这样做。

在处理递归正则表达式时，子表达式要有用得多，但是在这种情况下，我一直使用它。 我稍后必须进行基准测试，以确定它是否实际上只是过早的优化。

关于评论的更新

为避免发表评论，您可以只使用我已经提出的PPI解决方案。 它旨在解析perl代码，并且已经可以正常使用了。

但是，鉴于这是实验任务，因此正则表达式解决方案是在循环中设置第二个捕获组以查找注释：

while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
    my $quote = $1,
    my $comment = $2;

    if (defined $quote) {
        print "<$quote>\n";
    } elsif ($defined $comment) {
        print "Comment - $comment\n";
    }
}

上面的内容将匹配带引号的字符串或注释。 将定义实际匹配的捕获，以便您知道找到了哪个。 不过，您将不得不提出正则表达式以自行查找评论。

Perl多行字符串正则表达式

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-04-15 15:29:22

Perl多行字符串正则表达式

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-04-15 15:29:22

解决方案1
3 已采纳 2014-04-15 15:29:22