简体   繁体   中英

perl multiline string regex

I try to find all the strings (between " or ') in a file by reading the file line by line.

my @strings = ();
open FILE, $file or die "File operation failed: $!";
foreach my $line (<FILE>) {
    push(@strings, $1) if /(['"].*['"])/g;
}
close FILE;

The problem is this code work only for strings on a single line.

print "single line string";   

But I have to match also multiline strings like :

print "This is a
multiligne
string";

How can I do ?

By the way, I know my regex isn't good enough. Because it should match strings that start with " and finish with " (same with single quotes) but not if we have "not correct string'

Update : my new code is

my @strings = ();
open FILE, $file or die "File operation failed: $!";
local $/;
foreach my $line (<FILE>) {
    push(@strings, grep { defined and /["']/ } quotewords('\s+', 1, $_));
}
close FILE;

but if the data is :

print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

I should get :

"single line \n"
"This is a
multiline
string"
'single quote string'
"string with variable "
" after variable"

The following are two regex's for parsing either single or double quotes. Note, that I've slurped all the data in order to be able to catch multiline strings:

use strict;
use warnings;

my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};

my $data = do {local $/; <DATA>};

while ($data =~ /($squo_re|$dquo_re)/g) {
    print "<$1>\n";
}

__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

However, because you're trying to parse perl code, the cleanest way of doing it will be to use PPI though:

use strict;
use warnings;

use PPI;

my $src = do {local $/; <DATA>};

# Load a document
my $doc = PPI::Document->new( \$src );

# Find all the barewords within the doc
my $strings = $doc->find( 'PPI::Token::Quote' );
for (@$strings) {
    print '<', $_->content, ">\n";
}

__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

Both methods output:

<"single line \n">
<"This is a
multiline
string">
<'single quote string'>
<"string with variable ">
<" after variable">

Update about (?> ... )

The following is an annotated version of the double quote regular expression.

my $dquo_re = qr{
    "
        (?:                # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
            (?>            # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
                [^"\\]*    # All characters NOT a " or \
            )
        |
            \\.            # Backslash followed by any escaped character
        )*                 # Any number of the preceeding or'd group
    "
    }x;

The independent subexpression (?> ... ) it not actually required for this regex to work. It is intended to prevent backtracking because there is only one way for a quoted string to match, either we find a ending quote using the above rules or we don't.

The subexpression is a lot more useful when dealing with a recursive regex, but I've always used it in this case. I'll have to benchmark at a later to to decide if it's actually just a premature optimization.

Update about Comments

To avoid comments, you can just use the PPI solution that I already proposed. It's meant to parse perl code and will already work as it is.

However, given this is a lab assignment, a regex solution would be to setup a second capturing group in your loop for finding comments:

while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
    my $quote = $1,
    my $comment = $2;

    if (defined $quote) {
        print "<$quote>\n";
    } elsif ($defined $comment) {
        print "Comment - $comment\n";
    }
}

The above will match either a quoted string or a comment. Which capture actually matched will be defined so you can know which was found. You will have to come up with the regular expression for finding a comment on your own though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM