I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.
I have a reduced example attached , but a truncated visual representation of that looks like this:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000
-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000]
[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000
-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000
-0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111
1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222
-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]
[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222
2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]
[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222
-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]
[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222
-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]
[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222
-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]
[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222
-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000
-0.00000000 0.00000000 0.00000000]]"
New lines and all as /n's in the CSVs.
To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.
Like the following:
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
My main attempt to pull all the information from everything from a line between parentheses and brackets with:
awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}' Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','
outputting:
000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"
Gets close, but:
"[[]]"
), which I don't need it to do.tr
for to remove due to the necessary escape characters.I didn't understand your goal. The CSV file seems to me a correct CSV file. If you just want to remove line breaks, you can use Miller and clean-whitespace verb :
mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv
Assumptions:
broken_column_var
)broken_column_var
values contain at least one embedded linefeed (ie, each broken_column_var
value spans at least 2 physical lines); otherwise we need to add some code to address both double quotes residing on the same line... doable, but will skip for now so as to not (further) complicate the proposed codeOne (verbose) awk
approach to removing the embedded linefeeds from broken_column_var
while also replacing spaces with commas:
awk '
NR==1 { print; next } # print header
!in_merge && /["]/ { split($0,a,"\"") # 1st double quote found; split line on double quote
head = a[1] # save 1st part of line
data = "\"" a[2] # save double quote and 2nd part of line
in_merge = 1 # set flag
next
}
in_merge { data = data " " $0 # append current line to "data"
if ( $0 ~ /["]/ ) { # if 2nd double quote found => process "data"
gsub(/[ ]+/,",",data) # replace consecutive spaces with single comma
gsub(/,[]]/,"]",data) # replace ",]" with "]"
gsub(/[[],/,"[",data) # replace "[," with "["
print head data # print new line
in_merge = 0 # clear flag
}
}
' Malformed.csv
This generates:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"
Use double quote as the field separator. A complete record has 1 or 3 fields.
awk '
BEGIN {FS = OFS = "\""}
{$0 = prev $0; $1=$1}
NF % 2 == 1 {print; prev = ""; next}
{prev = $0}
END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000] [ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000 -0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111 1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222 -2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222] [-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222 2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222] [-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222 -2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222] [-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222 -2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ] [-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222 -2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ] [ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222 -2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000 -0.00000000 0.00000000 0.00000000]]"
For a language with a CSV library, I've found perl's Text::CSV useful for quoted newlines:
perl -e '
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
while (my $row = $csv->getline ($fh)) {
$row->[-1] =~ s/\n//g;
$csv->say(STDOUT, $row);
}
'
This might work for you (GNU sed):
sed -E '1b
:a;N;/"$/!ba
s/"/\n&/
h
s/\n/ /2g
s/.*\n//
s/ +/,/g
s/,\]/]/g
s/\[,/[/g
H
g
s/\n.*\n//' file
Forget the header line.
Gather up each record.
Introduce a newline before the last field.
Make a copy of the ameliorated record.
Replace all newlines from the second with spaces.
Remove upto the first introduced newline.
Replace spaces by commas.
Remove any introduced commas after or before square brackets.
Append the last field to the copy.
Make the copy current.
Remove everything between (and including) the introduced newlines.
NB Expects only the last field of each record is double quoted.
Alternative:
sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,\[/[/g;s/\],/]/g' file
You can use GoCSV's replace command to easily strip out newlines:
gocsv replace \
-c broken_column_var \
-regex '\s+' \
-repl ' ' \
input.csv
That normalizes all contiguous whitespace ( \s+
) to a single space.
A very small Python script can also handle this:
import csv
import re
ws_re = re.compile(r"\s+")
f_in = open("input.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
writer.writerow(next(reader)) # transfer header
for row in reader:
row[5] = ws_re.sub(" ", row[5])
writer.writerow(row)
csv
module:Pure bash, from How to parse a CSV file in Bash? , with a little modification checking for double-quotes parity.
#!/bin/bash
enable -f /usr/lib/bash/csv csv
exec {FD}< "$1"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[@]}"
numcols=${#headline[@]}
while read -ru $FD line; do
while chk=${line//[^\"]}
csv -a row -- "$line"
[[ -n ${chk//\"\"} ]] || (( ${#row[@]} < numcols )); do
read -ru $FD sline || break 2
line+=$'\n'"$sline"
done
printf "$fieldfmt\\n" "${row[@]}"
done
exec {FD}>&-
With your broken_input.csv
, this show:
SubID : "000000000"
Date1 : "0000-00-00"
date2 : "0000-00-00"
var1 : "0"
var2 : "FIRST\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000\n-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000\n0.00000000 0.00000000 0.00000000]\n[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000\n-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000\n-0.00000000 0.00000000 0.0000000 ]]'"
SubID : "000000000"
Date1 : "1111-11-11"
date2 : "1111-11-11"
var1 : "1"
var2 : "SECOND\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111\n1.11111111 1.11111111 1.11111111]]'"
SubID : "000000000"
Date1 : "2222-22-22"
date2 : "2222-22-22"
var1 : "2"
var2 : "THIRD\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222\n-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]\n[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222\n2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]\n[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222\n-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]\n[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222\n-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]\n[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222\n-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]\n[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222\n-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]'"
SubID : "111111111"
Date1 : "0000-00-00"
date2 : "0000-00-00"
var1 : "00"
var2 : "FIRST\ TEXT\ FOR\ ONE"
broken_column_var: "$'[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000\n-0.00000000 0.00000000 0.00000000]]'"
Your CSV seem not as so broken!!
Using bash for processing huge file (1,5Gb) is not recommended!! You may obtain better result using python , or c with appropriates libraries!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.