简体   繁体   English

Perl正则表达式存储数组中的匹配项

[英]Perl regex store matches in array

I have a file with strings in each row as follows 我有一个在每一行中都有字符串的文件,如下所示

"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2 "229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1下一行看起来像是84545,X,2

I'm trying to parse this text in Perl. 我正在尝试在Perl中解析此文本。 Note: quotes are present in the strings when there are several of them in a row, but not present if there is only item I would like to parse each item into an array. 注意:如果连续有多个引号,则引号会出现在字符串中,但是如果只有一个项目我不想将每个项目解析成一个数组,引号就不会出现。 I tried the following regex 我尝试了以下正则表达式

@fields = ($_ =~  /(\d+\_\d+),*/g);

but it is missing the last 2714 . 但它缺少最后的2714 How do I capture that edge case? 如何捕获这种边缘情况? Any help appreciated. 任何帮助表示赞赏。 Thanks in advance 提前致谢

It looks like you have a CSV File, so use an actual CSV parser for it like Text::CSV . 看来您有CSV档案,因此请使用实际的CSV解析器,例如Text::CSV

After you parse the columns, you can separate your first field into the array: 解析列之后,可以将第一个字段分隔为数组:

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new ( { binary => 1 } )  # should set binary attribute.
    or die "Cannot use CSV: ".Text::CSV->error_diag ();

my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};

if ($csv->parse($line)) {
    my @columns = $csv->fields();
    my @nums = split ',', $columns[0];

    print "@nums\n";
}

Outputs: 输出:

229269_2 190594_2 94552_2 266076_2 269628_2 165328_2 99319_2 263339_2 263300_2 99315_2 271509_2 2714

Why not a regex ? 为什么不使用正则表达式?

Yes, of course it's possible to use a regex for practically anything. 是的,当然几乎可以使用正则表达式。 But what you need to understand is that this will make your code extremely fragile and difficult to maintain. 但是,您需要了解的是,这将使您的代码极其脆弱且难以维护。

Even if you want to use a regular expression, you should STILL do this in two steps. 即使您要使用正则表达式,也应分两步执行。 First separate the initial column(s) of your CSV, and then process the specific column that you're worried about. 首先分离CSV的初始列,然后处理您担心的特定列。

Because you're just working with the first column, you could use code like the following: 因为您只使用第一列,所以可以使用如下代码:

use strict;
use warnings;

my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};

if ($line =~ /^"(.*?)"|^([^,]*)/) {
    my $column0 = $1 // $2;
    my @nums = split ',', $column0;

    print "@nums\n";
}

The above happens to accomplish the same thing as the previous code. 上面的代码恰好实现了与先前代码相同的功能。 However, it has one big flaw, it's not nearly as obvious to the maintaining programmer what's going on. 但是,它有一个很大的缺陷,对于正在维护的程序员来说,发生的事情并不那么明显。

Whenever a new coder, or even yourself in 6 months, views the first set of code, it is extremely obvious what format your data is in. You're working with a CSV file, and the first column is a list separated by commas. 每当新的编码器或什至六个月内您自己查看第一组代码时,就很明显地看到了数据的格式。您正在使用CSV文件,第一列是一个用逗号分隔的列表。 The second code also works, but the new maintainer must actually read the regex and figure out what's going on to understand both what format the data is in, and whether the code is actually doing it correctly. 第二个代码也可以工作,但是新的维护人员必须实际阅读正则表达式并弄清楚发生了什么,以了解数据的格式以及代码是否正确地执行了操作。

Anyway, do whatever you will, but I strongly advise you to use an actual CSV Parser for parsing csv files. 无论如何,请尽一切努力,但是我强烈建议您使用实际的CSV分析器来解析csv文件。

If all you want is all but the last two fields... 如果您只想要最后两个字段,那么...

   my $string = qq("229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1);
   $string =~ s/"//g;            # delete the quotes
   my @f = split (/,/, $string); # split on the comma
   pop @f; pop @f;               # jettison the last two columns

   # @f contains what you're looking for

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM