I try to find all the strings (between " or ') in a file by reading the file line by line.
my @strings = ();
open FILE, $file or die "File operation failed: $!";
foreach my $line (<FILE>) {
push(@strings, $1) if /(['"].*['"])/g;
}
close FILE;
The problem is this code work only for strings on a single line.
print "single line string";
But I have to match also multiline strings like :
print "This is a
multiligne
string";
How can I do ?
By the way, I know my regex isn't good enough. Because it should match strings that start with " and finish with " (same with single quotes) but not if we have "not correct string'
Update : my new code is
my @strings = ();
open FILE, $file or die "File operation failed: $!";
local $/;
foreach my $line (<FILE>) {
push(@strings, grep { defined and /["']/ } quotewords('\s+', 1, $_));
}
close FILE;
but if the data is :
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
I should get :
"single line \n"
"This is a
multiline
string"
'single quote string'
"string with variable "
" after variable"
The following are two regex's for parsing either single or double quotes. Note, that I've slurped all the data in order to be able to catch multiline strings:
use strict;
use warnings;
my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};
my $data = do {local $/; <DATA>};
while ($data =~ /($squo_re|$dquo_re)/g) {
print "<$1>\n";
}
__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
However, because you're trying to parse perl code, the cleanest way of doing it will be to use PPI
though:
use strict;
use warnings;
use PPI;
my $src = do {local $/; <DATA>};
# Load a document
my $doc = PPI::Document->new( \$src );
# Find all the barewords within the doc
my $strings = $doc->find( 'PPI::Token::Quote' );
for (@$strings) {
print '<', $_->content, ">\n";
}
__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
Both methods output:
<"single line \n">
<"This is a
multiline
string">
<'single quote string'>
<"string with variable ">
<" after variable">
Update about (?> ... )
The following is an annotated version of the double quote regular expression.
my $dquo_re = qr{
"
(?: # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
(?> # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
[^"\\]* # All characters NOT a " or \
)
|
\\. # Backslash followed by any escaped character
)* # Any number of the preceeding or'd group
"
}x;
The independent subexpression (?> ... )
it not actually required for this regex to work. It is intended to prevent backtracking because there is only one way for a quoted string to match, either we find a ending quote using the above rules or we don't.
The subexpression is a lot more useful when dealing with a recursive regex, but I've always used it in this case. I'll have to benchmark at a later to to decide if it's actually just a premature optimization.
Update about Comments
To avoid comments, you can just use the PPI
solution that I already proposed. It's meant to parse perl code and will already work as it is.
However, given this is a lab assignment, a regex solution would be to setup a second capturing group in your loop for finding comments:
while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
my $quote = $1,
my $comment = $2;
if (defined $quote) {
print "<$quote>\n";
} elsif ($defined $comment) {
print "Comment - $comment\n";
}
}
The above will match either a quoted string or a comment. Which capture actually matched will be defined so you can know which was found. You will have to come up with the regular expression for finding a comment on your own though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.