如何测试 File2 中是否存在来自 File1 每一行的 substring

Question

I have two files with following data我有两个包含以下数据的文件

file1:文件1：

6100540301SD01        ON5330399520191104906781            2019110390678151053303995ACK          20191105
6100540301SD01        ON0403096420191104225695            2019110322569551004030964A            20191105
6005260301SD01        46460045792019110490678911059455    2019110490678951000755694BE3        1120191105
6005260301SD01        46460045792019110490679616020577    2019110490679651000764053BDJDEDH    1620191105

file2:文件2：

20191104
20191105
20191106

Since file1 is fixed width file, the string at character position 97 to 104 is date.由于file1是定宽文件，字符 position 97 到 104 处的字符串是日期。 I want to extract the string by position from 97 to 104 and check if this exists in file2 .我想通过 position 从 97 到 104 提取字符串，并检查它是否存在于file2中。 If it exists, I want to copy whole line to file3 .If its not, I want to copy it to file4 .如果存在，我想将整行复制到file3 。如果不存在，我想将其复制到file4 。

I have created C++ program but it is taking long time to process the file1 while is almost half million records.我已经创建了 C++ 程序，但是处理file1需要很长时间，而几乎有 50 万条记录。 Therefore, if there is any awk/sed script that can be helpful, please share.因此，如果有任何awk/sed脚本可以提供帮助，请分享。

Answer 1

Turn the contents of file2 into a regular expression like 20191104|20191106|20191106 .将file2的内容转换为正则表达式，如20191104|20191106|20191106 。 Then you can use grep to match it.然后你可以使用grep来匹配它。

patterns=$(<file2)
# Replace newlines with |
pattern=${patterns//$'\n'/|}
# Put ^.{96} at the beginning so it matches starting at column 97
pattern="^.{96}($pattern)"
grep -E "$pattern" file1 > file3 # Lines that match
grep -v -E "$pattern" file1 > file4 # Lines that don't match

If running grep twice is too slow, you could use awk :如果运行grep两次太慢，您可以使用awk ：

awk -v pat="$pattern" '$0 ~ pat { print >>"file3"; next} {print >>"file4"}'

Answer 2

awk to the rescue! awk来救援！

$ awk 'NR==FNR {dates[$0]; next} 
               {print > (substr($0,97,104) in dates?"file3":"file4")}' file2 file1

Answer 3

This might work for you (GNU sed):这可能对您有用（GNU sed）：

sed 's#.*#/^.\\{96\\}&/ba#' file2 | sed -nf - -e 'w file4' -e 'b;:a;w file3' file1

Create a script from file2 which writes each match to file3 and any remaining lines to file4.从 file2 创建一个脚本，将每个匹配项写入 file3 并将任何剩余的行写入 file4。

The first invocation of sed passes its output to the second invocation of sed which in turn is supplemented with a couple of strings of commands inline. sed 的第一次调用将其 output 传递给 sed 的第二次调用，这反过来又补充了一对内联命令字符串。 All matches are sent to the loop holder :a which writes them out to file3 any that are not matched, fall through to be written to file4.所有匹配都被发送到循环持有者:a ，它将它们写出到 file3 任何不匹配的，落到被写入到 file4 中。

如何测试 File2 中是否存在来自 File1 每一行的 substring

问题描述

3 个解决方案

解决方案1
0 2019-11-07 21:38:36

解决方案2
0 已采纳 2019-11-07 21:44:04

解决方案3
0 2019-11-08 15:20:12

如何测试 File2 中是否存在来自 File1 每一行的 substring

问题描述

3 个解决方案

解决方案1 0 2019-11-07 21:38:36

解决方案2 0 已采纳 2019-11-07 21:44:04

解决方案3 0 2019-11-08 15:20:12

解决方案1
0 2019-11-07 21:38:36

解决方案2
0 已采纳 2019-11-07 21:44:04

解决方案3
0 2019-11-08 15:20:12