perl regex for extracting multiline blocks

Question

I have text like this:

 00:00 stuff 00:01 more stuff multi line and going 00:02 still have

So, I don't have a block end, just a new block start.

I want to recursively get all blocks:

 1 = 00:00 stuff 2 = 00:01 more stuff multi line and going

etc

The bellow code only gives me this:

$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';

What am I doing wrong?

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still 
have
    ';
my @array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(@array);

Answer 1

Version 5.10.0 introduced named capture groups that are useful for matching nontrivial patterns.

(?'NAME'pattern)
(?<NAME>pattern)

A named capture group. Identical in every respect to normal capturing parentheses () but for the additional fact that the group can be referred to by name in various regular expression constructs (such as \\g{NAME} ) and can be accessed by name after a successful match via %+ or %- . See perlvar for more details on the %+ and %- hashes.

If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match.

The forms (?'NAME'pattern) and (?<NAME>pattern) are equivalent.

Named capture groups allow us to name subpatterns within the regex as in the following.

use 5.10.0;  # named capture buffers

my $block_pattern = qr/
  (?<time>(?&_time)) (?&_sp) (?<desc>(?&_desc))

  (?(DEFINE)
    # timestamp at logical beginning-of-line
    (?<_time> (?m:^) [0-9][0-9]:[0-9][0-9])

    # runs of spaces or tabs
    (?<_sp> [ \t]+)

    # description is everything through the end of the record
    (?<_desc>
      # s switch makes . match newline too
      (?s: .+?)

      # terminate before optional whitespace (which we remove) followed
      # by either end-of-string or the start of another block
      (?= (?&_sp)? (?: $ | (?&_time)))
    )
  )
/x;

Use it as in

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still
have
    ';

while ($text =~ /$block_pattern/g) {
  print "time=[$+{time}]\n",
        "desc=[[[\n",
        $+{desc},
        "]]]\n\n";
}

Output:

$ ./blocks-demo
time=[00:00]
desc=[[[
stuff
]]]

time=[00:01]
desc=[[[
more stuff
multi line
 and going
]]]

time=[00:02]
desc=[[[
still
have
]]]

Answer 2

This should do the trick. Beginning of next \\d\\d:\\d\\d is treated as block end.

use strict;

my $Str = '00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have
00:03 still 
    have' ;

my @Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);

print join "--\n", @Blocks;

Answer 3

Your problem is that .*? is non-greedy in the same way that .* is greedy. When it is not forced, it matches as little as possible, which in this case is the empty string.

So, you'll need something after the non-greedy match to anchor up your capture. I came up with this regex:

my @array = $text =~ m/\n?([0-9]{2}:[0-9]{2}.*?)(?=\n[0-9]{2}:|$)/gs;

As you see, I removed the /m option to accurately be able to match end of string in the look-ahead assertion.

You might also consider this solution:

my @array = split /(?=[0-9]{2}:[0-9]{2})/, $text;

perl regex for extracting multiline blocks

Question

3 answers

solution1
4 2012-05-14 13:26:31

solution2
3 ACCPTED 2012-05-14 12:42:41

solution3
0 2012-05-14 12:42:09

perl regex for extracting multiline blocks

Question

3 answers

solution1 4 2012-05-14 13:26:31

solution2 3 ACCPTED 2012-05-14 12:42:41

solution3 0 2012-05-14 12:42:09

solution1
4 2012-05-14 13:26:31

solution2
3 ACCPTED 2012-05-14 12:42:41

solution3
0 2012-05-14 12:42:09