简体   繁体   中英

Remove last column from a csv file in Perl

I have a Perl script that accepts a comma separated csv file as input. I would like to discard the last column (the column number is known in advance).

The problem is that the last column may contain quoted strings with commas, in which case I would like to cut the entire string.

Example:

colA,colB,colC
1,2,3
4,5,"6,6"

What I would like to end up with is:

colA,colB
1,2
4,5

The current solution I have is using Linux cut command in the following manner:

cat $file | cut -d ',' -f 3 --complement

Which outputs the following:

colA,colB
1,2
4,5,6"

Which works great unless the last column is a quoted string with commas in it.

I can only use native Perl/Linux commands to solve this.

Appreciate your help

Using Text::CSV , as a script to process STDIN into STDOUT:

use strict;
use warnings;
use Text::CSV 'csv';

my $csv = csv(in => \*STDIN, keep_headers => \my @headers,
  auto_diag => 2, encoding => 'UTF-8');

pop @headers;

csv(in => $csv, out => \*STDOUT, headers => \@headers,
  auto_diag => 2, encoding => 'UTF-8');

The obvious benefit of this approach is handling all common edge cases automatically.

Try this based on awk-regex :

awk -v FPAT='([^,]+)|(\"[^\"]+\")'  -v OFS=',' '{print $1,$2}' ${file}

Example

echo '"4,4",5,"6,6"' | awk -v FPAT='([^,]+)|(\"[^\"]+\")'  -v OFS=',' '{print $1,$2}'
"4,4",5

Reference

If quoted strings with comma is the only trouble you are facing, you can use this:

$ sed -E 's/,"[^"]*"$|,[^,]*$//' ip.txt
colA,colB
1,2
4,5
  • ,"[^"]*"$ will match , followed by " followed by non " characters followed by " at the end of line
  • ,[^,]*$ will match , followed by non , characters at end of line

The double quoted column will match earlier in the string and thus gets deleted completely

Equivalent for perl would be perl -lpe 's/,"[^"]*"$|,[^,]*$//' ip.txt

I believe sungtm answer is correct and requries some explanation:

awk -v FPAT='([^,]+)|(\"[^\"]+\")'  -v OFS=',' '{print $1,$2}'

Is equivalent to:

script.awk

BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"; # gnu awk specific: FPAT is RegEx pattern to identify the field's content
    # [^,]+ ------ RegEx pattern to match all chars not ","
    #"[^\"]+\" ------ RegEx pattern to match all quated chars including the quotes
    #()|() ------ RegEx optional groups selector
    OFS = ","; # Output field separator
}
{ # for each input line/record
    print $1, $2; # print "1st field" OFS value "2nd field"
}

Runnig

awk -f scirpt.awk input.txt

Save the script in any file say script.pl Execute as prompt>perl script.pl /opt/filename.csv

  • "1","2,3",4,"test, test" ==> "1","2,3",4
  • 1,"2,3,4","5, 6","7,8" ==> 1,"2,3,4","5, 6"
  • 0,0,0,"test" ==> 0,0,0

Handles above cases

use strict;
if (scalar(@ARGV) != 1 ) {
    print "usage: perl script.pl absolute_file_path";
    exit;
}
my $filename = $ARGV[0]; # complete file path here
open(DATA, '<', $filename)
    or die "Could not open file '$filename' $!";

my @lines = <DATA>;
close(DATA);

my $counter=0;
open my $fo, '>', $filename;



foreach my $line(@lines) {
    chomp($line);

    my @update = split '(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)' , $line;
    my @update2;
    foreach (@update) {
       if($_=~/\w+/) {
           push(@update2,$_);
       }
    }
    pop(@update2);
    print @update2;
    my $str = join(',',@update2);
    print $fo "$str";
    unless (++$counter == scalar(@lines)) {
        print $fo "\n";
    }
}
close $fo;

Well this case is quite interesting - please see my solution bellow.

You can change $debug = 1; to see what happens and how this mechanism works

use strict;
use warnings;

my $debug = 0;

while( <DATA> ) {
    print "IN:  $_" if $debug;
    chomp;

    s/"(.+?)"/replace($1)/ge;   # do magic replacement , -> ___ in block of interest

    print "REP: $_\n" if $debug;

    my @data = split /,/;       # split into array
    pop @data;                  # pop last element of array

    my $line = join ',', @data; # merge array into a string

    $line =~ s/___/,/g;         # do unmagic replacement
    $line =~ s/\|/"/g;          # restore | -> "

    printf "%s$line\n", $debug ? "OUT: " : '';  # print result
}

sub replace {
    my $line = shift;

    $line =~ s/,/___/g;         # do magic replacement in our block

    return "|$line|";           # put | arount block of interest
}

__DATA__
colA,colB,colC
1,2,3
4,5,"6,6"
8,3,"1,2",37,82
64,12,"1,2,3,4",42,56
"3,4,7,8",2,8,"8,7,6,5,4",2,8
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1"
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1",3,4

Appreciate your help. Below is the solution I ended up using:

cat file.csv | perl -MText::ParseWords -nle '@f = parse_line(",",2, $_); tr/,/$/d for @f; print join ",", @f' | cut -d ',' -f 3 --complement | tr $ , ;

This will replace commas in field surrounded by quotes to the $ sign, to re replaced back after discarding the last unwanted column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM