简体   繁体   中英

Remove carriage returns from CSV data value

I am importing data from a pipe-delimited CSV to MySQL using a LOAD DATA INFILE statement. I am terminating lines by using '\\r\\n'. My problem is that some of the data within each row has '\\r\\n' in it, causing the load to error. I have similar files that just use '\\n' within data to indicate linebreaks, and that causes no issues.

Example GOOD CSV

School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New
Jersey
|USA\r

Example BAD CSV

School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New\r
Jersey\r
|USA\r

Is there a way to pre-process the CSV, using sed, awk, or perl, to clean up the extra carriage return in the column values?

This is one possible solution in perl. It reads in a line and if there are less than 4 fields, it keeps reading in the next line and merging it until it does have 4 fields. Just change the value of $number_of_fields to the right number.

#!/usr/bin/perl

use strict;
use warnings;

my $number_of_fields=4;

while(<STDIN>)
    {
    s/[\r\n]//g;
    my @fields=split(/\|/);
    next if($#fields==-1);   

    while($#fields<$number_of_fields-1)
        {
        my $nextline=<STDIN> || last;
        $nextline =~ s/[\r\n]//g;
        my @tmpfields=split(/\|/,$nextline);
        next if($#tmpfields==-1);
        $fields[$#fields] .= "\n".$tmpfields[0];
        shift @tmpfields;
        push @fields,@tmpfields;
        }
    print join("|",@fields),"\r\n";
    }

With GNU awk for multi-char RS and RT:

$ awk -v RS='([^|]+[|]){3}[^|]+\r\n' -v ORS= '{$0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n")} 1' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M

Note that it assumes the number of fields is 4 so if you have some other number of fields then change 3 to that number minus 1. The script COULD instead calculate the number of fields by reading the first line of your input if that first line cannot have your problem:

$ awk '
    BEGIN { RS="\r\n"; ORS=""; FS="|" }
    FNR==1 { RS="([^|]+[|]){"NF-1"}[^|]+\r\n"; RT=$0 RT }
    { $0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n"); print }
' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM