I am importing data from a pipe-delimited CSV to MySQL using a LOAD DATA INFILE statement. I am terminating lines by using '\\r\\n'. My problem is that some of the data within each row has '\\r\\n' in it, causing the load to error. I have similar files that just use '\\n' within data to indicate linebreaks, and that causes no issues.
Example GOOD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New
Jersey
|USA\r
Example BAD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New\r
Jersey\r
|USA\r
Is there a way to pre-process the CSV, using sed, awk, or perl, to clean up the extra carriage return in the column values?
This is one possible solution in perl. It reads in a line and if there are less than 4 fields, it keeps reading in the next line and merging it until it does have 4 fields. Just change the value of $number_of_fields
to the right number.
#!/usr/bin/perl
use strict;
use warnings;
my $number_of_fields=4;
while(<STDIN>)
{
s/[\r\n]//g;
my @fields=split(/\|/);
next if($#fields==-1);
while($#fields<$number_of_fields-1)
{
my $nextline=<STDIN> || last;
$nextline =~ s/[\r\n]//g;
my @tmpfields=split(/\|/,$nextline);
next if($#tmpfields==-1);
$fields[$#fields] .= "\n".$tmpfields[0];
shift @tmpfields;
push @fields,@tmpfields;
}
print join("|",@fields),"\r\n";
}
With GNU awk for multi-char RS and RT:
$ awk -v RS='([^|]+[|]){3}[^|]+\r\n' -v ORS= '{$0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n")} 1' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M
Note that it assumes the number of fields is 4 so if you have some other number of fields then change 3
to that number minus 1. The script COULD instead calculate the number of fields by reading the first line of your input if that first line cannot have your problem:
$ awk '
BEGIN { RS="\r\n"; ORS=""; FS="|" }
FNR==1 { RS="([^|]+[|]){"NF-1"}[^|]+\r\n"; RT=$0 RT }
{ $0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n"); print }
' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.