How to iterate over a multiline string with perl's regex

Question

I need to extract several sections from a multiline string with Perl. I'm applying the same regex in a while loop. My problem is to get the last section which ends with the file. My workaround is to append the marker. This way the regex will always find and end. Is there a better way to do it?

Example file:

Header

==== /home/src/file1.c#1 ====
content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

==== /home/src/file2.c#1 ====
content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2

Perl script:

#!/usr/bin/env perl

my $desc = do { local $/ = undef; <> };

$desc .= "\n===="; # set the end marker

while($desc =~ /^==== (?<filename>.*?)#.*?====$(?<content>.*?)(?=^====)/mgsp) {
  print "filename=", $+{filename}, "\n";
  print "content=", $+{content}, "\n";
}

This way the script finds both segments. How can I avoid adding the marker?

Answer 1

Use of the greediness modifier ? is a giant red flag. You can usually get away with using it once in a pattern, but more than that is usually a bug. If you want to match text that doesn't contain a string, use the following instead:

(?:(?!STRING).)*

So that gets you the following:

/
   ^==== [ ] (?<filename> [^\n]+ ) [ ] ====\n
   (?<content> (?:(?! ^==== ).)* )
/xsmg

Code:

my $desc = do { local $/; <DATA> };

while (
   $desc =~ /
      ^==== [ ] (?<filename> [^\n]+ ) [ ] ====\n
      (?<content> (?:(?! ^==== ).)* )
   /xsmg
) {
   print "filename=<<$+{filename}>>\n";
   print "content=<<$+{content}>>\n";
}

__DATA__
Header

==== /home/src/file1.c#1 ====
content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

==== /home/src/file2.c#1 ====
content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2

Output:

filename=<</home/src/file1.c#1>>
content=<<content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

>>
filename=<</home/src/file2.c#1>>
content=<<content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2
>>

Answer 2

You've made this more awkward by slurping the whole file in the first place. This is relatively simple if you read the file line-by-line

use strict;
use warnings 'all';

my $file;

while ( <> ) {
    if ( /^====\s+(.*\S)#\S*\s+====/ ) {
        $file = $1;
        print "filename=$file\n";
        print 'content=';
    }
    elsif ( $file ) {
        print;
    }
}

output

filename=/home/src/file1.c
content=content file1
line 1 of file1
line 2 of file1
line 3 of file1

another line of file1

filename=/home/src/file2.c
content=content file2
line 1 of file2
line 2 of file2
line 3 of file2

another line of file2

Alternatively, if you need to store the whole content per file, perhaps as a hash, it would look like this

use strict;
use warnings 'all';

my $file;
my %data;

while ( <> ) {
    if ( /^====\s+(.*\S)#\S*\s+====/ ) {
        $file = $1;
    }
    elsif ( $file ) {
        $data{$file} .= $_;
    }
}

for my $file ( sort keys %data ) {
    print "filename=$file\n";
    print "content=$data{$file}";
}

The output is identical to that of the first version above

How to iterate over a multiline string with perl's regex

Question

2 answers

solution1
4 ACCPTED 2016-06-15 03:35:01

solution2
1 2016-06-15 14:22:33

output

How to iterate over a multiline string with perl's regex

Question

2 answers

solution1 4 ACCPTED 2016-06-15 03:35:01

solution2 1 2016-06-15 14:22:33

output

solution1
4 ACCPTED 2016-06-15 03:35:01

solution2
1 2016-06-15 14:22:33