[英]Parsing a large CSV file with unusual characters, spacing, brackets, and irregular returns in bash
I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.我有一个非常大的 (1.5 GB) 格式错误的 CSV 文件,我需要读入 R,虽然文件本身是 CSV,但由于行返回位置不当,分隔符在一定数量的行后中断。
I have a reduced example attached , but a truncated visual representation of that looks like this:我附上了一个简化的示例,但它的截断视觉表示如下所示:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000
-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000
0.00000000 0.00000000 0.00000000]
[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000
-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000
-0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111
1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222
-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]
[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222
2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]
[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222
-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]
[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222
-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]
[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222
-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]
[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222
-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000
-0.00000000 0.00000000 0.00000000]]"
New lines and all as /n's in the CSVs. CSV 中的新行和所有作为/n 的行。
To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.为了解决将其全部加载到 memory 并尝试在其他环境中将其解析为 dataframe 的问题,我一直在尝试将 CSV 中的相关片段打印到终端,删除字符回车,折叠空格,并在变量之间输入逗号.
Like the following:像下面这样:
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
My main attempt to pull all the information from everything from a line between parentheses and brackets with:我主要尝试从括号和方括号之间的一行中提取所有信息:
awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}' Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','
outputting:输出:
000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"
Gets close, but:接近,但是:
"[[]]"
), which I don't need it to do."[[]]"
) 之前的空白处插入逗号,我不需要这样做。tr
for to remove due to the necessary escape characters.tr
正确调用以删除它们。I didn't understand your goal.我不明白你的目标。 The CSV file seems to me a correct CSV file.
CSV 文件在我看来是正确的 CSV 文件。 If you just want to remove line breaks, you can use Miller and clean-whitespace verb :
如果只想删除换行符,可以使用Miller和clean-whitespace 动词:
mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv
to get this https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv得到这个https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.881449867118 8个
Assumptions:假设:
broken_column_var
)broken_column_var
)broken_column_var
values contain at least one embedded linefeed (ie, each broken_column_var
value spans at least 2 physical lines);broken_column_var
值包含至少一个嵌入式换行符(即,每个broken_column_var
值跨越至少 2 个物理行); otherwise we need to add some code to address both double quotes residing on the same line... doable, but will skip for now so as to not (further) complicate the proposed codeOne (verbose) awk
approach to removing the embedded linefeeds from broken_column_var
while also replacing spaces with commas:一种(详细的)
awk
方法,用于从broken_column_var
中删除嵌入式换行符,同时还用逗号替换空格:
awk '
NR==1 { print; next } # print header
!in_merge && /["]/ { split($0,a,"\"") # 1st double quote found; split line on double quote
head = a[1] # save 1st part of line
data = "\"" a[2] # save double quote and 2nd part of line
in_merge = 1 # set flag
next
}
in_merge { data = data " " $0 # append current line to "data"
if ( $0 ~ /["]/ ) { # if 2nd double quote found => process "data"
gsub(/[ ]+/,",",data) # replace consecutive spaces with single comma
gsub(/,[]]/,"]",data) # replace ",]" with "]"
gsub(/[[],/,"[",data) # replace "[," with "["
print head data # print new line
in_merge = 0 # clear flag
}
}
' Malformed.csv
This generates:这会产生:
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"
Use double quote as the field separator.使用双引号作为字段分隔符。 A complete record has 1 or 3 fields.
一条完整的记录有 1 或 3 个字段。
awk '
BEGIN {FS = OFS = "\""}
{$0 = prev $0; $1=$1}
NF % 2 == 1 {print; prev = ""; next}
{prev = $0}
END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000] [ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000 -0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000 0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111 1.11111111 1.11111111 1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222 -2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222] [-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222 2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222] [-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222 -2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222] [-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222 -2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ] [-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222 -2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ] [ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222 -2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000 -0.00000000 0.00000000 0.00000000]]"
For a language with a CSV library, I've found perl's Text::CSV useful for quoted newlines:对于具有 CSV 库的语言,我发现 perl 的Text::CSV对引用的换行符很有用:
perl -e '
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
while (my $row = $csv->getline ($fh)) {
$row->[-1] =~ s/\n//g;
$csv->say(STDOUT, $row);
}
'
This might work for you (GNU sed):这可能对你有用(GNU sed):
sed -E '1b
:a;N;/"$/!ba
s/"/\n&/
h
s/\n/ /2g
s/.*\n//
s/ +/,/g
s/,\]/]/g
s/\[,/[/g
H
g
s/\n.*\n//' file
Forget the header line.忘记 header 行。
Gather up each record.收集每条记录。
Introduce a newline before the last field.在最后一个字段之前引入换行符。
Make a copy of the ameliorated record.复制一份改进后的记录。
Replace all newlines from the second with spaces.用空格替换第二个的所有换行符。
Remove upto the first introduced newline.删除直到第一个引入的换行符。
Replace spaces by commas.用逗号替换空格。
Remove any introduced commas after or before square brackets.删除方括号前后引入的任何逗号。
Append the last field to the copy. Append 复制最后一个字段。
Make the copy current.使副本成为当前副本。
Remove everything between (and including) the introduced newlines.删除引入的换行符之间(包括在内)的所有内容。
NB Expects only the last field of each record is double quoted. NB 期望只有每条记录的最后一个字段是双引号。
Alternative:选择:
sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,\[/[/g;s/\],/]/g' file
You can use GoCSV's replace command to easily strip out newlines:您可以使用GoCSV 的替换命令轻松删除换行符:
gocsv replace \
-c broken_column_var \
-regex '\s+' \
-repl ' ' \
input.csv
That normalizes all contiguous whitespace ( \s+
) to a single space.这会将所有连续的空白 (
\s+
) 规范化为单个空格。
A very small Python script can also handle this:一个非常小的 Python 脚本也可以处理这个:
import csv
import re
ws_re = re.compile(r"\s+")
f_in = open("input.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
writer.writerow(next(reader)) # transfer header
for row in reader:
row[5] = ws_re.sub(" ", row[5])
writer.writerow(row)
csv
module:csv
模块: Pure bash, from How to parse a CSV file in Bash?纯 bash,来自How to parse a CSV file in Bash? , with a little modification checking for double-quotes parity.
, 稍作修改检查双引号奇偶校验。
#!/bin/bash
enable -f /usr/lib/bash/csv csv
exec {FD}< "$1"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[@]}"
numcols=${#headline[@]}
while read -ru $FD line; do
while chk=${line//[^\"]}
csv -a row -- "$line"
[[ -n ${chk//\"\"} ]] || (( ${#row[@]} < numcols )); do
read -ru $FD sline || break 2
line+=$'\n'"$sline"
done
printf "$fieldfmt\\n" "${row[@]}"
done
exec {FD}>&-
With your broken_input.csv
, this show:使用您的
broken_input.csv
,这表明:
SubID : "000000000"
Date1 : "0000-00-00"
date2 : "0000-00-00"
var1 : "0"
var2 : "FIRST\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ -0.00000000 0.00000000 -0.00000000 -0.00000000 0.00000000\n-0.00000000 -0.00000000 0.00000000 0.00000000 0.00000000\n0.00000000 0.00000000 0.00000000]\n[ -0.00000000 -0.0000000 -0.00000000 -0.00000000 -0.0000000\n-0.0000000 -0.0000000 0.00000000 0.00000000 -0.00000000\n-0.00000000 0.00000000 0.0000000 ]]'"
SubID : "000000000"
Date1 : "1111-11-11"
date2 : "1111-11-11"
var1 : "1"
var2 : "SECOND\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ 1.11111111 -1.11111111 -1.1111111 -1.1111111 1.1111111\n1.11111111 1.11111111 1.11111111]]'"
SubID : "000000000"
Date1 : "2222-22-22"
date2 : "2222-22-22"
var1 : "2"
var2 : "THIRD\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[-2.2222222 2.22222222 -2.22222222 -2.22222222 2.2222222 -2.22222222\n-2.22222222 -2.22222222 -2.22222222 2.22222222 2.22222222 2.22222222]\n[-2.22222222 -2.22222222 2.22222222 2.2222222 2.22222222 -2.22222222\n2.2222222 -2.2222222 2.22222222 2.2222222 2.222222 -2.22222222]\n[-2.22222222 -2.2222222 2.22222222 2.2222222 2.22222222 -2.22222222\n-2.22222222 -2.2222222 -2.22222222 2.22222222 2.2222222 2.22222222]\n[-2.22222222 -2.22222222 2.2222222 2.2222222 2.2222222 -2.22222222\n-2.222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 2.2222222 ]\n[-2.22222222 -2.222222 2.22222222 2.22222222 2.22222222 -2.2222222\n-2.2222222 -2.2222222 -2.2222222 -2.22222222 2.22222222 -2.222222 ]\n[ 2.22222222 -2.22222222 -2.222222 -2.222222 -2.2222222 -2.22222222\n-2.222222 -2.22222222 2.2222222 -2.2222222 2.2222222 2.22222222]]'"
SubID : "111111111"
Date1 : "0000-00-00"
date2 : "0000-00-00"
var1 : "00"
var2 : "FIRST\ TEXT\ FOR\ ONE"
broken_column_var: "$'[[ -0.00000000 0.00000000 -0.00000000 0.000000 -0.00000000\n-0.00000000 0.00000000 0.00000000]]'"
Your CSV seem not as so broken!!你的 CSV 好像没那么坏!!
Using bash for processing huge file (1,5Gb) is not recommended!!不推荐使用bash处理大文件 (1,5Gb)!! You may obtain better result using python , or c with appropriates libraries!
您可以使用python或c和适当的库获得更好的结果!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.