简体   繁体   English

如何使用正则表达式解析 Perl 中引用的 CSV ?

[英]How can I parse quoted CSV in Perl with a regex?

I'm having some issues with parsing CSV data with quotes.我在用引号解析 CSV 数据时遇到了一些问题。 My main problem is with quotes within a field.我的主要问题是字段中的引号。 In the following example lines 1 - 4 work correctly but 5,6 and 7 don't.在下面的示例中,第 1 - 4 行正常工作,但第 5,6 和 7 行不能正常工作。

COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,

I'd like to avoid Text::CSV as it isn't installed on the target server .我想避免 Text::CSV 因为它没有安装在目标服务器上 Realising that CSV's are are more complicated than they look I'm using a recipe from the Perl Cookbook.意识到 CSV 比看起来更复杂,我正在使用 Perl Cookbook 中的食谱。

sub parse_csv {
  my $text = shift; #record containg CSVs
  my @columns = ();
  push(@columns ,$+) while $text =~ m{
    # The first part groups the phrase inside quotes
    "([^\"\\]*(?:\\.[^\"\\]*)*)",?
      | ([^,]+),?
      | ,
    }gx;
  push(@columns ,undef) if substr($text, -1,1) eq ',';
  return @columns ; # list of vars that was comma separated.
}

Does anyone have a suggestion for improving the regex to handle the above cases?有没有人建议改进正则表达式来处理上述情况?

Please, Try Using CPAN请尝试使用 CPAN

There's no reason you couldn't download a copy of Text::CSV , or any other non-XS based implementation of a CSV parser and install it in your local directory, or in a lib/ sub directory of your project so its installed along with your projects rollout.您没有理由不能下载Text::CSV的副本,或任何其他基于非 XS 的 CSV 解析器的实现,并将其安装在您的本地目录或项目的 lib/ 子目录中,以便一起安装随着您的项目推出。

If you can't store text files in your project, then I'm wondering how it is you are coding your project.如果您无法在项目中存储文本文件,那么我想知道您是如何编写项目的。

http://novosial.org/perl/life-with-cpan/non-root/ http://novosial.org/perl/life-with-cpan/non-root/

Should be a good guide on how to get these into a working state locally.应该是一个很好的指南,说明如何将这些内容放入本地工作的 state。

Not using CPAN is really a recipe for disaster.不使用 CPAN 确实是灾难的根源。

Please consider this before trying to write your own CSV implementation.请在尝试编写自己的 CSV 实现之前考虑这一点。

Text::CSV is over a hundred lines of code, including fixed bugs and edge cases, and re-writing this from scratch will just make you learn how awful CSV can be the hard way. Text::CSV有超过一百行代码,包括修复的错误和边缘情况,从头开始重新编写只会让您了解 CSV 是多么糟糕。

note: I learnt this the hard way.注意:我很难学到这一点。 Took me a full day to get a working CSV parser in PHP before I discovered an inbuilt one had been added in a later version.我花了一整天的时间在 PHP 中找到一个可以工作的 CSV 解析器,然后我才发现在更高版本中添加了一个内置解析器。 It really is something awful.这真是一件可怕的事情。

You can parse CSV using Text::ParseWords which ships with Perl.您可以使用 Perl 附带的Text::ParseWords解析 CSV。

use Text::ParseWords;

while (<DATA>) {
    chomp;
    my @f = quotewords ',', 0, $_;
    say join ":" => @f;
}

__DATA__
COLLOQ_TYPE,COLLOQ_NAME,COLLOQ_CODE,XDATA
S,"BELT,FAN",003541547,
S,"BELT V,FAN",000324244,
S,SHROUD SPRING SCREW,000868265,
S,"D" REL VALVE ASSY,000771881,
S,"YBELT,"V"",000323030,
S,"YBELT,'V'",000322933,

which parses your CSV correctly....它可以正确解析您的 CSV ....

# => COLLOQ_TYPE:COLLOQ_NAME:COLLOQ_CODE:XDATA
# => S:BELT,FAN:003541547:
# => S:BELT V,FAN:000324244:
# => S:SHROUD SPRING SCREW:000868265:
# => S:D REL VALVE ASSY:000771881:
# => S:YBELT,V:000323030:
# => S:YBELT,'V':000322933:

The only issue I've had with Text::ParseWords is when nested quotes in data aren't escaped correctly.我对 Text::ParseWords 的唯一问题是数据中的嵌套引号未正确转义。 However this is badly built CSV data and would cause problems with most CSV parsers;-)然而,这是糟糕的 CSV 数据,会导致大多数 CSV 解析器出现问题;-)

So you may notice that所以你可能会注意到

# S,"YBELT,"V"",000323030,

came out as (ie. quotes dropped around "V")出来了(即引号在“V”周围掉线)

# S:YBELT,V:000323030:

however if its escaped like so但是,如果它像这样逃脱

# S,"YBELT,\"V\"",000323030,

then quotes will be retained然后报价将被保留

# S:YBELT,"V":000323030:

This works like charm这就像魅力

line is assumed to be comma separated with embeded,假定行以逗号分隔并嵌入,

my @columns = Text::ParseWords::parse_line(',', 0, $line);

tested;经测试; working:-在职的:-

$_.=','; # fake an ending delimiter

while($_=~/"((?:""|[^"])*)",|([^,]*),/g) {
  $cell=defined($1) ? $1:$2; $cell=~s/""/"/g; 
  print "$cell\n";
}

# The regexp strategy is as follows:
# First - we attempt a match on any quoted part starting the CSV line:-
#  "((?:""|[^"])*)",
# It must start with a quote, and end with a quote followed by a comma, and is allowed to contain either doublequotes - "" - or anything except a sinlge quote [^"] - this goes into $1
# If we can't match that, we accept anything up to the next comma instead, & put it into $2
# Lastly, we convert "" to " and print out the cell.

be warned that CSV files can contain cells with embedded newlines inside the quotes, so you'll need to do this if reading the data in line-at-a-time:请注意,CSV 文件可能包含在引号内嵌入换行符的单元格,因此如果一次读取一行数据,则需要执行此操作:

if("$pre$_"=~/,"[^,]*\z/) {
  $pre.=$_; next;
}
$_="$pre$_";

Tested:测试:


use Test::More tests => 2;

use strict;

sub splitCommaNotQuote {
    my ( $line ) = @_;

    my @fields = ();

    while ( $line =~ m/((\")([^\"]*)\"|[^,]*)(,|$)/g ) {
        if ( $2 ) {
            push( @fields, $3 );
        } else {
            push( @fields, $1 );
        }
        last if ( ! $4 );
    }

    return( @fields );
}

is_deeply(
    +[splitCommaNotQuote('S,"D" REL VALVE ASSY,000771881,')],
    +['S', '"D" REL VALVE ASSY', '000771881', ''],
    "Quote in value"
);
is_deeply(
    +[splitCommaNotQuote('S,"BELT V,FAN",000324244,')],
    +['S', 'BELT V,FAN', '000324244', ''],
    "Strip quotes from entire value"
);

Finding matching pairs using regexs is non-trivial and generally unsolvable task.使用正则表达式查找匹配对是一项非常重要且通常无法解决的任务。 There are plenty of examples in the Jeffrey Friedl's Mastering regular expressions book. Jeffrey Friedl 的精通正则表达式一书中有很多例子。 I don't have it at hand now, but I remember that he used CSV for some examples, too.我现在手头没有它,但我记得他也用 CSV 作为一些例子。

You can (try to) use CPAN.pm to simply have your program install/update Text::CSV.您可以(尝试)使用 CPAN.pm 来简单地让您的程序安装/更新 Text::CSV。 As said before, you can even "install" it to a home or local directory, and add that directory to @INC (or, if you prefer not to use BEGIN blocks, you can use lib 'dir'; - it's probably better).如前所述,您甚至可以将其“安装”到主目录或本地目录,然后将该目录添加到 @INC(或者,如果您不想使用BEGIN块,您可以use lib 'dir'; - 可能更好) .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM