简体   繁体   中英

Perl: extracting data from text using regex

I am using Perl to do text processing with regex. I have no control over the input. I have shown some examples of the input below.

As you can see the items B and C can be in the string n times with different values. I need to get all the values as back reference. Or if you know of a different way i am all ears.

I am trying to use branch reset pattern (as outlined at perldoc: "Extended Patterns" ) I am not having much luck matching the string.

("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

My Perl is below, any help would be great. Thanks for any help you can give.

if($inputString =~/\("Data" \(Int "A" ([0-9]+)\)(?:\(Int "B" ([0-9]+)\)\(Int "C" ([0-9]+)\))+\(Int "D" ([0-9]+)\)\(Int "E" ([0-9]+)\)\)/) {

    print "\n\nmatched\n";

    print "1: $1\n";
    print "2: $2\n";
    print "3: $3\n";
    print "4: $4\n";
    print "5: $5\n";
    print "6: $6\n";
    print "7: $7\n";
    print "8: $8\n";
    print "9: $9\n";

}

Don't try to use one regex a set of regexes and splits are easier to understand:

#!/usr/bin/perl

use strict;
use warnings;

while (<DATA>) {
    next unless my ($data) = /\("Data" (.*)\)/;
    print "on line $., I saw:\n";
    for my $item ($data =~ /\((.*?)\)/g) {
        my ($type, $var, $num) = split " ", $item;
        print "\ttype $type var $var num $num\n";
    }
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

If your data can stretch across lines, I would suggest using a parser instead of a regex.

I am not sure what benefit there would be in getting the values as back references - who would you wish to deal with the case of duplicated keys (like "C" in the second line). Also I am not sure what you wish to do with the values once extracts.

But I would start with something like:

use Data::Dumper;

while (<DATA>)
{
    my @a = m!\(Int "(.*?)" ([0-9]+)\)!g;
    print Dumper(\@a);
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C"     6)(Int "D" 34896)(Int "E" 38046)) 
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

This gives you an array of repeated key,value(s).

My initial thought was to use named captures and to get the values from %- :

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )+
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

Unfortunately, the (?:...) grouping doesn't trigger capturing multiple values for B and C. I suspect that this is a bug. Doing it explicitly does capture all the values but you would have to know the maximum number of instances ahead of time.

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    \(Int\s+"B"\s+(?<B>[0-9]+)\)
    \(Int\s+"C"\s+(?<C>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    # repeat (?:...) N times
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

The simplest approach is to use m//g . You can either capture name/value pairs as Beano suggests or use multiple patterns to capture each value:

my @b = m/Int "B" ([0-9]+)/g;
my @c = m/Int "C" ([0-9]+)/g;
# etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM