[英]How to Split a Delimited Text file in Linux, based on no of records, which has end-of-record separator in data fields
Problem Statement: 问题陈述:
I have a delimited text file offloaded from Teradata which happens to have "\\n" (newline characters or EOL markers) inside data fields. 我从Teradata卸载了一个分隔的文本文件,该文件恰好在数据字段中包含“ \\ n”(换行符或EOL标记)。
The same EOL marker is at the end of each new line for one entire line or record. 对于一个完整的行或记录,在每个新行的末尾都使用相同的EOL标记。
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines. 我需要将此文件拆分为两个或多个文件(基于我给出的记录数),同时在数据字段中保留换行符,但要针对每行末尾的换行符。
Example: 例:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation : 期望值:
file1.txt file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt file2.txt
3|Gordon
Freeman|11
What i have tried : 我试过的
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. 代码无法区分数据字段换行符和实际换行符。 Is there a way it can be achieved?
有没有办法可以实现?
EDIT:: i have changed the problem statement with example. 编辑::我已经改变了问题的例子。 Please share your thoughts on the new example.
请分享您对新示例的想法。
Use the following awk approach: 使用以下awk方法:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0
- your logical single line occupies two physical records, so we expect to separate on each 4 records NR%4==0
您的逻辑单行占用两条物理记录,因此我们希望每4条记录分开 Results : 结果 :
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS
appropriately, eg: 如果您使用的是GNU awk,则可以通过适当设置
RS
来做到这一点,例如:
parse.awk parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this: 像这样运行它:
gawk -f parse.awk n=2 infile
Where n
is the number of records to put into each file. 其中
n
是要放入每个文件中的记录数。
Output: 输出:
file1.txt file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt file2.txt
3|Gordon
Freeman|11
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.