简体   繁体   中英

Capturing select data between certain lines in a file in Perl.

I have a file with contents of this sort:

*** X REGION ***
|-------------------------------------------------------------------------------------------------|
| X                                                                                               |                                                                                           
| addr              tag          extra data   |
|-------------------------------------------------------------------------------------------------|
| $A1    label_A1X                   |       1 |
| $A2    label_A2X                   |       2 |
| $A3    label_A3X                   |       3 |

*** Y REGION ***

|-------------------------------------------------------------------------------------------------|
| Y                                                                                            |
| addr              tag           extra data  |
|-------------------------------------------------------------------------------------------------|
| $0     label_0Y                    |        99 |
| $1                                 |        98 |

I need to capture the data under 'addr' and 'tag'; separated by commas; separately for the records under 'X REGION' and 'Y REGION'. Here's what I tried:

open($fh1, "<", $memFile) or warn "Cannot open $memFile, $!";            #input file with contents as described above. 

open($fh, "+<", $XFile) or warn "Cannot open $XFile, $!";                
open($fh2, "+<", $YFile) or warn "Cannot open $YFile, $!";               

while(my $line = <$fh1>)
{

  chomp $line;
  $line = $line if (/\s+\*\*\*\s+X REGION\s+\*\*\*/ .. /\s+\*\*\*\s+Y REGION\s+\*\*\*/);        #Trying to get at the stuff in the X region.
  if($line =~ /\s+|\s+\$(.*)\s+(.*)\s+|(.*)/) 
  {
    $line = "$1,$2";
    print $fh $line; 
    print $fh "\n";
  }

  my $lastLineNum = `tail -1 filename`;
  $line = $line if (/\*\*\* Y REGION \*\*\*/ .. $lastLineNum);                      #Trying to get at the stuff in the Y region.
  if($line =~ /\s+|\s+\$(.*)\s+(.*)\s+|(.*)/)
  {
    $line = "$1,$2";
    print $fh2 $line;
    print $fh2 "\n";
  }

}

This says $1 and $2 are uninitialized. Is the regex incorrect? Else (or also) what else is?

This is a snippet of code that operates as you need (taking full advantage of the default perl implicit var $_ ):

# use die instead of warn, don't go ahead if there is no file
open(my $fin, "<", $memFile) or die "Cannot open $memFile, $!"; 

while(<$fin>)
{
    # Flip flop between X and Y regions
    if (/[*]{3}\h+X REGION\h+[*]{3}/../[*]{3}\h+Y REGION\h+[*]{3}/) {
        print "X: $1,$2\n" if (/.*\$(\S*)\h*(\S*)\h*[|]/)
    }

    # Flip flop from Y till the end, using undef no need of external tail
    if (/[*]{3}\h+Y REGION\h+[*]{3}/..undef) {
        print "Y: $1,$2\n" if (/.*\$(\S*)\h*(\S*)\h*[|]/)
    }
}

This is the output:

X: A1,label_A1X
X: A2,label_A2X
X: A3,label_A3X
Y: 0,label_0Y
Y: 1,

Online running demo

Talking about your code there are many points to fix:

  • in your regex to select the elements between the delimiters the pipe | needs escaping: using a backslash \\| or the char class [|] (i prefer the latter)

  • \\s matches also newline (strictly \\n or carriage return \\r ), don't use it as a general space plus tab \\t replacement. Use \\h (only horizontal spaces) instead

  • you start the regex with \\s+ but in the example the first char of the table lines is always '|'

  • .* matches anything till (spaces included) apart from newlines ( \\n or \\r )

  • So a regex like .*\\s+ matches the entire line plus the newline ( \\s ) and possible spaces in the next line too

  • The flip-flop perl operator .. gives you the lines in the selected region (edge included) but one line per time as always, so also the escaped pipe form of your regex:

    \\s+[|]\\s+\\$(.*)\\s+(.*)\\s+[|](.*)

    can't match at all see as it behaves on the text .

So i've so replaced the data extracting regex with this one:

.*\$(\S*)\h*(\S*)\h*[|]

Regex Breakout

.*\$     # matches all till a literal dollar '$'
(\S*)    # Capturing group $1, matches zero or more non-space char [^\s]
         # can be replaced with (\w*) if your labels matches [0-9a-zA-Z_]
\h*      # Match zero or more horizontal spaces 
(\S*)    # Capturing group $2, as above
\h*      # Match zero or more horizontal spaces 
[|]      # Match a literal pipe '|'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM