[英]How to append lines that match two patterns to the previous line in a file?
I have a csv file where what's supposed to be a single line, is split across several.我有一个 csv 文件,其中应该是一行,但被分成了几行。 I need help to find a way to join the lines that are split.
我需要帮助来找到一种方法来加入被分割的行。 Also, the number of fields (separated by ,) is not fixed.
此外,字段的数量(由 , 分隔)不是固定的。
A correct line has the following pattern:正确的行具有以下模式:
X,X,X,"() ",Y,H where X can be any number of fields. X,X,X,"() ",Y,H其中 X 可以是任意数量的字段。 However, the bold part (end of the string) is fixed.
但是,粗体部分(字符串的结尾)是固定的。 Y and H are both one word.
Y 和 H 都是一个词。
The issue is that this line can appear as (or any variant of this):问题是这条线可以显示为(或任何变体):
X,X, X,X,
X, "()" X, ”()”
,Y,H ,Y,H
What I need is a way (awk, sed) of appending the lines that don't have 24 or more commas and do not end with ",Y,H, to the previous line.我需要的是一种方法(awk,sed)将没有 24 个或更多逗号且不以“,Y,H,”结尾的行附加到上一行。
Please bear in mind that it's a large file, although I have 256 GB of RAM.请记住,这是一个大文件,尽管我有 256 GB 的 RAM。
Example例子
a, b, c, "()", h, k a, b, c, "()", h, k
a, b, c, d, "()", h, k a, b, c, d, "()", h, k
First line第一行
a, b, c,一,乙,丙,
"()", h, k "()", h, k
Second line第二行
a, b, c, d, "()" A B C D, ”()”
, h , H
, k , k
So far I've tried this (not working):到目前为止,我已经尝试过这个(不工作):
awk '/"[:space:]*,[:space:]*[:alpha:]+[:space:]*,[:space:]*[:alpha:]+$/{print}' check.csv awk '/"[:space:]*,[:space:]*[:alpha:]+[:space:]*,[:space:]*[:alpha:]+$/{print}' 检查。 CSV
to try to find the lines ending with ", X, Y where X and Y are words.尝试找到以 ", X, Y 结尾的行,其中 X 和 Y 是单词。
Also, as the minimum number of correct fields is 24, I've used:此外,由于正确字段的最小数量为 24,我使用过:
awk 'NF<24{print}' check.csv awk 'NF<24{print}' check.csv
to filter out lines with less than 24 fields.过滤掉少于 24 个字段的行。
My idea is to detect lines that match both regular expressions and append them to the previous line.我的想法是检测与两个正则表达式匹配的行并将它们附加到上一行。
Thank you!谢谢!
This might work for you (GNU sed):这可能对您有用(GNU sed):
sed '/"()", *[^,]\+, *[^,]\+$/b;:a;N;s/\n//;/"()", *[^,]\+, *[^,]\+$/!ba;P;D' file
Do not process a correct line, just bail out.不要处理正确的线路,只是退出。
Otherwise append the next line, remove the introduced newline and try and match again.否则追加下一行,删除引入的换行符并再次尝试匹配。
Repeat until a match, then print/delete the first line and repeat.重复直到匹配,然后打印/删除第一行并重复。
perl -lanF, -e 'push @L, grep length, @F; if ($L[-3] eq q/"()"/) { print join ",", @L; @L=() }' file
-l -n -e
to loop over input lines w/o printing, append linebreaks to output-l -n -e
在不打印的情况下循环输入行,将换行符附加到输出-a -F,
to create @F
array by splitting input on commas-a -F,
通过在逗号上拆分输入来创建@F
数组push @L, grep length, @F
push nonempty fields onto @L
push @L, grep length, @F
将非空字段推送到@L
if ($L[-3] eq q/"()"/)
- if the 3rd to last accumulated field is the magic marker: if ($L[-3] eq q/"()"/)
- 如果倒数第三个累积字段是魔术标记:
print join ",", @L
print all of @L
joined with commas print join ",", @L
print all of @L
join with commas@L=()
reset @L
@L=()
重置@L
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.