简体   繁体   English

根据查询ID的列和列表从固定格式的无空格文件中提取行

[英]Extracting lines from a fixed format without spaces file based on a column and list of inquiring IDs

I have a quite large fixed format file without spaces (file1):我有一个非常大的没有空格的固定格式文件 (file1):

file1:文件1:

0808563800555550000367120000500000
0005555566369330000078020000500000
01066666780000000008933600009000005635
0904251263088000000786590056500000
0000469011009904440425120444444440

I want to extract lines with fields 4-8,11-15 and 20-24 when fields 4-8 (only) are in a list of IDs in file2当字段 4-8(仅)在 file2 的 ID 列表中时,我想提取字段 4-8、11-15 和 20-24 的行

file2:文件2:

55555
42512

The desired outputs are:所需的输出是:

55555 36933 07802
42512 08800 78659

I have tried the following combination of cut | grep我尝试了以下cut | grep组合cut | grep commands: cut | grep命令:

cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -w -F -f file2

It works fine and the speed is very good, but the problem is that I am getting columns where the lookup ID (fields 4-8) is not in the first column of the cutted data, and that is because grep checks the three columns after cut, not only the first one.它工作正常并且速度非常好,但问题是我得到的列的查找 ID(字段 4-8)不在切割数据的第一列中,这是因为 grep 检查了之后的三列切,不仅是第一个。

Here are the outputs of the command above:以下是上述命令的输出:

85638 55555 36712
55555 36933 07802
66666 00000 89336
42512 08800 78659
04690 00990 42512

I know one may write the output to a file and then use, for example awk, but I thought there could be a much simpler approach to avoid longer processing time (for example, makes grep picks only the match in a specific cutted column).我知道有人可能会将输出写入文件然后使用,例如 awk,但我认为可能有一种更简单的方法来避免更长的处理时间(例如,让 grep 仅选择特定剪切列中的匹配项)。

Any help will be very appreciated and many thanks!任何帮助将不胜感激,非常感谢!

Would you please try the following:请您尝试以下操作:

cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -wf <(sed 's/^/^/' file2)

Each line in file2 is prepended by a caret ^ character to anchor to the start of the line of the output by cut . file2中的每一行前面都有一个脱字符^字符,以锚定到cut输出行的开头。
It may be a bit slower than before due to the lack of -F option.由于缺少-F选项,它可能比以前慢一点。

With GNU awk for FIELDWIDTHS :使用FIELDWIDTHS的 GNU awk:

$ awk -v FIELDWIDTHS='3 5 2 5 4 5 *' 'NR==FNR{a[$0]; next} $2 in a{ print $2, $4, $6 }' file2 file1
55555 36933 07802
42512 08800 78659

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM