使用 bash 查询一个大的制表符分隔文件

Question

I have a list of names and IDs (50 entries)我有一个姓名和 ID 列表（50 个条目）

cat input.txt

name    ID
Mike    2000
Mike    20003
Mike    20002

And there is a huge zipped file (13GB)还有一个巨大的压缩文件（13GB）

zcat clients.gz

name    ID  comment
Mike    2000    foo
Mike    20002   bar
Josh    2000    cake
Josh    20002   _

My expected output is我的预期输出是

NR  name    ID  comment
1    Mike   2000    foo
3    Mike   20002   bar

each $1"\\t"$2 of clients.gz is a unique identifier. clients.gz 的每个$1"\\t"$2都是一个唯一标识符。 There might be some entries from input.txt that might be missing from clients.gz .可能有一些来自input.txt条目可能从clients.gz丢失。 Thus, I would like to add the NR column to my output to find out which are missing.因此，我想将 NR 列添加到我的输出中以找出缺少哪些列。 I would like to use zgrep.我想使用 zgrep。 awk takes a very long time (since I had to zcat for uncompress the zipped file I assume?) awk 需要很长时间（因为我必须使用zcat来解压缩我假设的压缩文件？）

I know that zgrep 'Mike\\t2000' does not work.我知道zgrep 'Mike\\t2000'不起作用。 The NR issue I can fix with awk FNR I imagine.我可以用我想象的 awk FNR 解决 NR 问题。

So far I have:到目前为止，我有：

awk -v q="'" 
'
NR > 1 {
print "zcat clients.gz | zgrep -w $" q$0q
}' input.txt |
bash > subset.txt

Answer 1

With GNU awk and bash:使用 GNU awk 和 bash：

awk 'BEGIN{FS=OFS="\t"} 
     # process input.txt
     NR==FNR{
       a[$1,$2]=$1 FS $2
       line[$1,$2]=NR-1
       next
     }
     # process <(zcat clients.gz)
     {
       $4=a[$1,$2]
       if(FNR==1)
         line[$1,$2]="NR"
       if($4!="")
         print line[$1,$2],$1,$2,$3
     }' input.txt <(zcat clients.gz)

Output:输出：

NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

As one line:作为一行：

awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1,$2]=$1 FS $2; line[$1,$2]=NR-1; next} {$4=a[$1,$2]; if(FNR==1) line[$1,$2]="NR"; if($4!="")print line[$1,$2],$1,$2,$3}' input.txt <(zcat clients.gz)

See: Joining two files based on two key columns awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR请参阅：基于两个关键列 awk 连接两个文件和8 个强大的 awk 内置变量 – FS、OFS、RS、ORS、NR、NF、FILENAME、FNR

Answer 2

[EDIT] [编辑]
I've misunderstood where the prepended line numbers come from.我误解了前置行号的来源。 Corrected.更正。

Would you try the following:你会尝试以下方法吗：

declare -A num          # asscoiates each pattern to the line number
mapfile -t ary < <(tail -n +2 input.txt)
pat=$(IFS='|'; echo "${ary[*]}")
for ((i=0; i<${#ary[@]}; i++)); do num[${ary[i]}]=$((i+1)); done
printf "%s\t%s\t%s\t%s\n" "NR" "name" "ID" "comment"
zgrep -E -w "$pat" clients.gz | while IFS= read -r line; do
    printf "%d\t%s\n" "${num[$(cut -f 1-2 <<<"$line")]}" "$line"
done

Output:输出：

NR  name    ID  comment
1   Mike    2000    foo
3   Mike    20002   bar

The second line and third generate a search pattern as Mike 2000|Mike 20003|Mike 20002 from input.txt .第二行和第三行从input.txt生成一个搜索模式为Mike 2000|Mike 20003|Mike 20002 。
The line for ((i=0; i<${#ary[@]}; i++)); do .. for ((i=0; i<${#ary[@]}; i++)); do .. for ((i=0; i<${#ary[@]}; i++)); do .. creates a map from the pattern to the number. for ((i=0; i<${#ary[@]}; i++)); do ..创建从模式到数字的映射。
The expression "${num[$(cut -f 1-2 <<<"$line")]}" retrieves the line number from the 1st and 2nd fields of the output.表达式"${num[$(cut -f 1-2 <<<"$line")]}"从输出的第一个和第二个字段中检索行号。

If the performance is not still satisfactory, please consider ripgrep which is much faster than grep or zgrep .如果性能仍然不令人满意，请考虑ripgrep ，它比grep或zgrep 。

Answer 3

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ key = $1 FS $2 }
NR == FNR { map[key] = (NR>1 ? NR-1 : "NR"); next }
key in map { print map[key], $0 }

$ zcat clients.gz | awk -f tst.awk input.txt -
NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

使用 bash 查询一个大的制表符分隔文件

问题描述

3 个解决方案

解决方案1
1 2020-01-26 09:05:18

解决方案2
1 2020-01-26 09:11:42

解决方案3
1 已采纳 2020-01-26 16:58:57

使用 bash 查询一个大的制表符分隔文件

问题描述

3 个解决方案

解决方案1 1 2020-01-26 09:05:18

解决方案2 1 2020-01-26 09:11:42

解决方案3 1 已采纳 2020-01-26 16:58:57

解决方案1
1 2020-01-26 09:05:18

解决方案2
1 2020-01-26 09:11:42

解决方案3
1 已采纳 2020-01-26 16:58:57