简体   繁体   English

如何使用Perl从CSV文件提取多个列

[英]How to extract multiple columns from a CSV file using Perl

I'm pretty new with Perl and was hoping if anyone could help me with this issue. 我对Perl很陌生,希望有人能帮助我解决这个问题。 I need to extract two columns from a CSV file embedded commas. 我需要从CSV文件嵌入式逗号中提取两列。 This is how the format looks like: 格式如下所示:

"ID","URL","DATE","XXID","DATE-LONGFORMAT"

I need to extract the DATE column, the XXID column, and the column immediately after XXID . 我需要提取DATE列, XXID列以及XXID之后的列。 Note, each line doesn't necessarily follow the same number of columns. 请注意,每一行不一定遵循相同的列数。

The XXID column contains a 2 letter prefix and doesn't always starts with the same letter. XXID列包含2个字母的前缀,并不总是以相同的字母开头。 It can pretty much be any letter of the aplhabet. 它几乎可以是aplhabet的任何字母。 The length is always the same. 长度始终相同。

Finally, once these three columns are extracted, I need to sort on the XXID column and get a count on duplicates. 最后,一旦提取了这三列,我就需要对XXID列进行排序并获得重复项的计数。

Here's a sample script using the Text::CSV module to parse your csv data. 这是一个使用Text :: CSV模块来解析csv数据的示例脚本。 Consult the documentation for the module to find the proper settings for your data. 请查阅该模块的文档,以找到适合您的数据的设置。

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;

my $csv = Text::CSV->new({ binary => 1 });

while (my $row = $csv->getline(*DATA)) {
    print "Date: $row->[2]\n";
    print "Col#1: $row->[3]\n";
    print "Col#2: $row->[4]\n";
}

I published a module called Tie::Array::CSV which lets Perl interact with your CSV as a native Perl nested array. 我发布了一个名为Tie::Array::CSV的模块,该模块允许Perl作为本地Perl嵌套数组与CSV交互。 If you use this, you can take your search logic and apply it just as if your data were already in an array of array-references. 如果使用此方法,则可以采用搜索逻辑并应用它,就像数据已经在数组引用数组中一样。 Take a look! 看一看!

#!/usr/bin/env perl

use strict;
use warnings;

use File::Temp;
use Tie::Array::CSV;
use List::MoreUtils qw/first_index/;
use Data::Dumper;

# this builds a temporary file from DATA
# normally you would just make $file the filename
my $file = File::Temp->new;
print $file <DATA>;
#########

tie my @csv, 'Tie::Array::CSV', $file;

#find column from data in first row
my $colnum = first_index { /^\w.{6}$/ } @{$csv[0]};
print "Using column: $colnum\n";

#extract that column
my @column = map { $csv[$_][$colnum] } (0..$#csv);

#build a hash of repetitions
my %reps;
$reps{$_}++ for @column;

print Dumper \%reps;

You definitely want to use a CPAN library for parsing CSV, as you will never account for all the quirks of the format. 您绝对希望使用CPAN库来解析CSV,因为您将永远不会考虑该格式的所有怪癖。

Please see: How can I parse quoted CSV in Perl with a regex? 请参阅: 如何使用正则表达式解析Perl中引用的CSV?

Please see: How do I efficiently parse a CSV file in Perl? 请参阅: 如何在Perl中有效地解析CSV文件?

However, here is a very naive and non-idiomatic solution for that particular string you provided: 但是,对于您提供的特定字符串,这是一个非常幼稚且非惯用的解决方案:

use strict;
use warnings;

my $string = '"ID","URL","DATE","XXID","DATE-LONGFORMAT"';

my @words = ();
my $word = "";
my $quotec = '"';
my $quoted = 0;

foreach my $c (split //, $string)
{
  if ($quoted)
  {
    if ($c eq $quotec)
    {
      $quoted = 0;
      push @words, $word;
      $word = "";
    }
    else
    {
      $word .= $c;
    }
  }
  elsif ($c eq $quotec)
  {
    $quoted = 1;
  }
}

for (my $i = 0; $i < scalar @words; ++$i)
{
  print "column " . ($i + 1) . " = $words[$i]\n";
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM