[英]remove newline in the middle of csv file
I need to clean a csv file looking like this:我需要清理一个如下所示的 csv 文件:
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking
price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
Yes sometimes without double quote, but the new line occurs only with double quote fields.是的,有时没有双引号,但新行只出现在双引号字段中。 The issue happens only with 4th field.该问题仅发生在第 4 个字段中。
I work on a awk command and it's now what I have:我在 awk 命令上工作,现在它就是我所拥有的:
awk '{ if (substr($4,1,1) == "\"" && substr($4,length($4)) != "\"") gsub(/\n/," ");}' FS=";" input_file
This awk look if first char of the field is a double quote and if the last one isn't a double quote.如果字段的第一个字符是双引号并且最后一个字符不是双引号,这个 awk 会检查。 Then try to remove the new line but he clearly didn't removing it.然后尝试删除新行,但他显然没有删除它。
I think I miss a little "easy" thing but can't figure out what is it.我想我错过了一点“简单”的东西,但无法弄清楚它是什么。
Thanks for your help.谢谢你的帮助。
You may use this awk
:你可以使用这个awk
:
awk -F';' -v ORS= '1; {print (NF==4 ? " " : "\n")}' file
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
How it works:怎么运行的:
ORS
to empty character initially.此命令最初将ORS
设置为空字符。NF == 4
otherwise it prints a line break.然后它在NF == 4
时打印一个空格,否则它打印一个换行符。Using GNU sed
使用 GNU sed
$ sed -Ez 's/(;"[^"]*)\n/\1/g' input_file
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
One idea for tweaking OP's current awk
code:调整 OP 当前awk
代码的一个想法:
awk -F';' '
{ if (substr($4,1,1) == "\"" && substr($4,length($4)) != "\"") { # if we have an incomplete line then ...
printf $0 # printf, sans a "\n", will leave the cursor at the end of the current line
next # skip to next line of input
}
}
1 # otherwise print current line
' input_file
# or as a one-liner sans comments:
awk -F';' ' { if (substr($4,1,1) == "\"" && substr($4,length($4)) != "\"") { printf $0; next } } 1 ' input_file
This generates:这会产生:
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
With GNU awk for RT
:使用用于RT
的 GNU awk:
$ awk -v RS='"' '!(NR%2){gsub(/\n/,"")} {ORS=RT} 1' file
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
This might work for you (GNU sed):这可能对你有用(GNU sed):
sed -E ':a;/^[^\"]*(\\.[^\"]*)*("[^\"]*(\\.[^"\]*)*"[^\"]*)*"[^\"]*(\\.[^"\]*)*$/{N;s/\n//;ta}' file
This matches any unbalanced double quotes (with or without escaped double quotes), appends the following line, removes the newline and repeats until the double quotes are balanced.这匹配任何不平衡的双引号(带或不带转义双引号),附加以下行,删除换行符并重复直到双引号平衡。
A simpler solution in which escaped double quotes are forgone:一个更简单的解决方案,其中放弃了转义双引号:
sed -E ':a;/^[^"]*("[^"]*"[^"]*)*"[^"]*$/{N;s/\n//;ta}' file
echo 'food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking
price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO' |
mawk '(ORS = (_<ORS)==(NF % 2)? RS: _)^_' FS=';' | gcat -n
1 food;1;ZZ;"lipsum";NR
2 foobar;123;NA;"asking price";NR
3 foobar;5;NN;Random text;NN
4 moongoo;13;VV;"Any label";OO
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.