简体   繁体   中英

perl extract text between SAME delimiter using flip-flop

I have been able to use flip-flop to extract text in past where I have different START & END. This time I've been having A LOT of trouble trying to extract text because I do not have different delimiters in my source file, because START & END of flip flop are the same. I want flip flop to start true when line beings with year yyyy & continue to push $_ to an array until another line begins yyyy. The problem with flip-flop is that it will then be false on my next START.

while (<SOURCEFILE>) {
  print if (/^2017/ ... /^2017/) 
}

Using the above for the given source data will miss the 2nd multi-line part of the file I also need to match. Maybe flip-flop which I thought was the best way to parse a multi line file will not work in this case? What I want to do is start matching with the first line starting with date & continue matching until the line before the next line beginning with a date.

Sample Data is:

2017 message 1
Text
Text

Text

2017 message 2
more text
more text

more text

2017 message 3
yet more text
yet more text

yet more text

But I am getting:

2017 message 1
Text
Text

Text

2017 message 2
2017 message 3
yet more text
yet more text

yet more text

...missing message 2 contents..

I cannot rely on space or a different END delimiter in my source data. What I wanted was for each message to be printed (actually push @myarray, $_ & then test for matches), but here I am missing lines below message 2 because flip flop is set to false. Any way to handle this with flip-flop or I need to use something else? Thanks in advance for anyone that can help/advise.

Here is a way to go:

use Modern::Perl;
use Data::Dumper;
my $part = -1;
my $parts;
while(<DATA>) {
    chomp;
    if (/^2017/ .. 1==0) {
        $part++ if /^2017/;
        push @{$parts->[$part]}, $_;
    }
}
say Dumper$parts;

__DATA__
2017 message 1
Text
Text

Text

2017 message 2
more text
more text

more text

2017 message 3
yet more text
yet more text

yet more text

Output:

$VAR1 = [
          [
            '2017 message 1',
            'Text',
            'Text',
            '',
            'Text',
            ''
          ],
          [
            '2017 message 2',
            'more text',
            'more text',
            '',
            'more text',
            ''
          ],
          [
            '2017 message 3',
            'yet more text',
            'yet more text',
            '',
            'yet more text'
          ]
        ];

I don't know how to do it with flipflop. I tried it before a year. But the same thing i did with some logic.

my $line_concat;
my $f = 0;
while (<DATA>) {
    if(/^2017/ && !$f) {
        $f = 1;
    }

    if (/^2017/) {
        print "$line_concat\n" if $line_concat ne "";
        $line_concat = "";
    }

    $line_concat .= $_ if $f;
}

print $line_concat if $line_concat ne "";

Flip flop with a matched delimiter doesn't work too well, as you've found.

Have you considered setting $/ instead?

Eg:

#!/usr/bin/env perl
use strict;
use warnings; 

local $/ = "2017 message";
my $count;

while ( <DATA> ) {

    print "\nStart of block:", ++$count, "\n";

    print;

    print "\nEnd of block:", $count, "\n";
}

__DATA__
2017 message 1
Text
Text

Text

2017 message 2
more text
more text

more text

2017 message 3
yet more text
yet more text

yet more text

Although it's not perfect, because it splits the file on the delimiter - meaning there's a 'bit' before the first one (so you get 4 chunks). You can resplice it with judicious use of 'chomp', which removes $/ from the current chunk:

#!/usr/bin/env perl
use strict;
use warnings; 

local $/ = "2017 message";
my $count;

while ( <DATA> ) {
    #remove '2017 message'
    chomp;
    #check for empty (first) block
    next unless /\S/;
    print "\nStart of block:", ++$count, "\n";
    #re add '2017 message'
    print $/;
    print;

    print "\nEnd of block:", $count, "\n";
}

Alternatively, how about an array of arrays, that you update the 'target key' each time you hit a message?

#!/usr/bin/env perl
use strict;
use warnings; 

use Data::Dumper;

my %messages; 
my $message_id;
while ( <DATA> ) {
   chomp;
   if ( m/2017 message (\d+)/ ) { $message_id = $1 }; 
   push @{ $messages{$message_id} }, $_; 
}

print Dumper \%messages;

Note - I'm using a hash, not an array, because that's a bit more robust for messages sequencing that doesn't start consecutively from zero. (And array using this approach would have an empty 'zeroth' element).

Note - it also will have 'empty' '' elements for you blank lines. You can filter these if you wish though.

You just need a buffer that accumulates the lines until you find one matching /^20\\d\\d[ ]/ or end of file.

my $in = 0;
my @buf;
while (<>) {
   if ($in && /^20\d\d[ ]/) {
      process(@buf);
      @buf = ();
      $in = 0;
   }

   push @buf, $_ if $in ||= /^2017[ ]/;
}

process(@buf) if $in;

We can rearrange the code to make it so the records are only processed in one spot, allowing process to be inlined.

my $in = 0;
my @buf;
while (1) {
   $_ = <>;

   if ($in && (!defined($_) || /^20\d\d[ ]/)) {
      process(@buf);
      @buf = ();
      $in = 0;
   }

   last if !defined($_);

   push @buf, $_ if $in ||= /^2017[ ]/;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM