简体   繁体   English

Perl:使用正则表达式从文本中提取数据

[英]Perl: extracting data from text using regex

I am using Perl to do text processing with regex. 我正在使用Perl使用正则表达式进行文本处理。 I have no control over the input. 我无法控制输入。 I have shown some examples of the input below. 我在下面显示了一些输入示例。

As you can see the items B and C can be in the string n times with different values. 如您所见,项目B和C可以在字符串中n次使用不同的值。 I need to get all the values as back reference. 我需要获取所有值作为回参考。 Or if you know of a different way i am all ears. 或者,如果您知道另外一种方式,我会非常注意。

I am trying to use branch reset pattern (as outlined at perldoc: "Extended Patterns" ) I am not having much luck matching the string. 我正在尝试使用分支重置模式(如perldoc概述:“扩展模式” ),我与字符串匹配的运气并不好。

("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

My Perl is below, any help would be great. 我的Perl在下面,任何帮助都会很棒。 Thanks for any help you can give. 谢谢你提供的所有帮助。

if($inputString =~/\("Data" \(Int "A" ([0-9]+)\)(?:\(Int "B" ([0-9]+)\)\(Int "C" ([0-9]+)\))+\(Int "D" ([0-9]+)\)\(Int "E" ([0-9]+)\)\)/) {

    print "\n\nmatched\n";

    print "1: $1\n";
    print "2: $2\n";
    print "3: $3\n";
    print "4: $4\n";
    print "5: $5\n";
    print "6: $6\n";
    print "7: $7\n";
    print "8: $8\n";
    print "9: $9\n";

}

Don't try to use one regex a set of regexes and splits are easier to understand: 不要尝试使用一个正则表达式一组正则表达式,并且拆分更容易理解:

#!/usr/bin/perl

use strict;
use warnings;

while (<DATA>) {
    next unless my ($data) = /\("Data" (.*)\)/;
    print "on line $., I saw:\n";
    for my $item ($data =~ /\((.*?)\)/g) {
        my ($type, $var, $num) = split " ", $item;
        print "\ttype $type var $var num $num\n";
    }
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C" 6)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

If your data can stretch across lines, I would suggest using a parser instead of a regex. 如果您的数据可以跨越行,我建议使用解析器而不是正则表达式。

I am not sure what benefit there would be in getting the values as back references - who would you wish to deal with the case of duplicated keys (like "C" in the second line). 我不确定将这些值用作反向引用会有什么好处-您想和谁处理重复键的情况(例如第二行中的“ C”)。 Also I am not sure what you wish to do with the values once extracts. 另外,我不确定一旦提取后您希望使用这些值做什么。

But I would start with something like: 但我将从以下内容开始:

use Data::Dumper;

while (<DATA>)
{
    my @a = m!\(Int "(.*?)" ([0-9]+)\)!g;
    print Dumper(\@a);
}

__DATA__
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "D" 34896)(Int "E" 38046))
("Data" (Int "A" 22)(Int "B" 1)(Int "C" 2)(Int "B" 3)(Int "C" 4)(Int "B" 5)(Int "C"     6)(Int "D" 34896)(Int "E" 38046)) 
("Data" (Int "A" 22)(Int "B" 22)(Int "C" 59)(Int "B" 1143)(Int "C" 1210)(Int "B" 1232)(Int "C" 34896)(Int "D" 34896)(Int "E" 38046))

This gives you an array of repeated key,value(s). 这为您提供了一个重复的键值数组。

My initial thought was to use named captures and to get the values from %- : 我最初的想法是使用命名捕获并从%-获取值:

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )+
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

Unfortunately, the (?:...) grouping doesn't trigger capturing multiple values for B and C. I suspect that this is a bug. 不幸的是, (?:...)分组不会触发捕获B和C的多个值。我怀疑这是一个错误。 Doing it explicitly does capture all the values but you would have to know the maximum number of instances ahead of time. 明确地执行此操作确实会捕获所有值,但是您必须提前知道最大实例数。

my $pattern = qr/
  \(
    "Data"\s+
    \(Int\s+"A"\s+(?<A>[0-9]+)\)
    \(Int\s+"B"\s+(?<B>[0-9]+)\)
    \(Int\s+"C"\s+(?<C>[0-9]+)\)
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    (?:
      \(Int\s+"B"\s+(?<B>[0-9]+)\)
      \(Int\s+"C"\s+(?<C>[0-9]+)\)
    )?
    # repeat (?:...) N times
    \(Int\s+"D"\s+(?<D>[0-9]+)\)
    \(Int\s+"E"\s+(?<E>[0-9]+)\)
  \)
/x;

The simplest approach is to use m//g . 最简单的方法是使用m//g You can either capture name/value pairs as Beano suggests or use multiple patterns to capture each value: 您可以按照Beano的建议捕获名称/值对,也可以使用多种模式捕获每个值:

my @b = m/Int "B" ([0-9]+)/g;
my @c = m/Int "C" ([0-9]+)/g;
# etc.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM