I have a Perl script that accepts a comma separated csv file as input. I would like to discard the last column (the column number is known in advance).
The problem is that the last column may contain quoted strings with commas, in which case I would like to cut the entire string.
Example:
colA,colB,colC
1,2,3
4,5,"6,6"
What I would like to end up with is:
colA,colB
1,2
4,5
The current solution I have is using Linux cut command in the following manner:
cat $file | cut -d ',' -f 3 --complement
Which outputs the following:
colA,colB
1,2
4,5,6"
Which works great unless the last column is a quoted string with commas in it.
I can only use native Perl/Linux commands to solve this.
Appreciate your help
Using Text::CSV , as a script to process STDIN into STDOUT:
use strict;
use warnings;
use Text::CSV 'csv';
my $csv = csv(in => \*STDIN, keep_headers => \my @headers,
auto_diag => 2, encoding => 'UTF-8');
pop @headers;
csv(in => $csv, out => \*STDOUT, headers => \@headers,
auto_diag => 2, encoding => 'UTF-8');
The obvious benefit of this approach is handling all common edge cases automatically.
Try this based on awk-regex :
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}' ${file}
Example
echo '"4,4",5,"6,6"' | awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
"4,4",5
Reference
If quoted strings with comma is the only trouble you are facing, you can use this:
$ sed -E 's/,"[^"]*"$|,[^,]*$//' ip.txt
colA,colB
1,2
4,5
,"[^"]*"$
will match ,
followed by "
followed by non "
characters followed by "
at the end of line ,[^,]*$
will match ,
followed by non ,
characters at end of line The double quoted column will match earlier in the string and thus gets deleted completely
Equivalent for perl
would be perl -lpe 's/,"[^"]*"$|,[^,]*$//' ip.txt
I believe sungtm answer is correct and requries some explanation:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
Is equivalent to:
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"; # gnu awk specific: FPAT is RegEx pattern to identify the field's content
# [^,]+ ------ RegEx pattern to match all chars not ","
#"[^\"]+\" ------ RegEx pattern to match all quated chars including the quotes
#()|() ------ RegEx optional groups selector
OFS = ","; # Output field separator
}
{ # for each input line/record
print $1, $2; # print "1st field" OFS value "2nd field"
}
awk -f scirpt.awk input.txt
Save the script in any file say script.pl Execute as prompt>perl script.pl /opt/filename.csv
Handles above cases
use strict;
if (scalar(@ARGV) != 1 ) {
print "usage: perl script.pl absolute_file_path";
exit;
}
my $filename = $ARGV[0]; # complete file path here
open(DATA, '<', $filename)
or die "Could not open file '$filename' $!";
my @lines = <DATA>;
close(DATA);
my $counter=0;
open my $fo, '>', $filename;
foreach my $line(@lines) {
chomp($line);
my @update = split '(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)' , $line;
my @update2;
foreach (@update) {
if($_=~/\w+/) {
push(@update2,$_);
}
}
pop(@update2);
print @update2;
my $str = join(',',@update2);
print $fo "$str";
unless (++$counter == scalar(@lines)) {
print $fo "\n";
}
}
close $fo;
Well this case is quite interesting - please see my solution bellow.
You can change $debug = 1;
to see what happens and how this mechanism works
use strict;
use warnings;
my $debug = 0;
while( <DATA> ) {
print "IN: $_" if $debug;
chomp;
s/"(.+?)"/replace($1)/ge; # do magic replacement , -> ___ in block of interest
print "REP: $_\n" if $debug;
my @data = split /,/; # split into array
pop @data; # pop last element of array
my $line = join ',', @data; # merge array into a string
$line =~ s/___/,/g; # do unmagic replacement
$line =~ s/\|/"/g; # restore | -> "
printf "%s$line\n", $debug ? "OUT: " : ''; # print result
}
sub replace {
my $line = shift;
$line =~ s/,/___/g; # do magic replacement in our block
return "|$line|"; # put | arount block of interest
}
__DATA__
colA,colB,colC
1,2,3
4,5,"6,6"
8,3,"1,2",37,82
64,12,"1,2,3,4",42,56
"3,4,7,8",2,8,"8,7,6,5,4",2,8
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1"
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1",3,4
Appreciate your help. Below is the solution I ended up using:
cat file.csv | perl -MText::ParseWords -nle '@f = parse_line(",",2, $_); tr/,/$/d for @f; print join ",", @f' | cut -d ',' -f 3 --complement | tr $ , ;
This will replace commas in field surrounded by quotes to the $ sign, to re replaced back after discarding the last unwanted column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.