[英]Extracting rows from file based on another file using awk
I have two files.我有两个文件。
File 1:文件 1:
SNP Allele1 Allele2 Effect StdErr PVAL Direction HetISq HetChiSHetDf HetPVal
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
rs7995014 t c 2.2910 0.5012 4.853e-06 +? 0.0 0.000 0 1
You may use this awk<\/code> :
你可以使用这个
awk<\/code> :
awk 'FNR==NR {a[$3]; next} FNR> 1 && $1 in a' file2 file1
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
Depending on how big the dataset is, this should be fairly fast, only accessing each file once.根据数据集的大小,这应该相当快,每个文件只访问一次。 Granted, not on a system where I can compare at the moment, so mostly a hunch.
当然,不是在我目前可以比较的系统上,所以主要是一种预感。 A solution like this is probably only suitable if the amount of unique identifiers isn't very large, though.
不过,这样的解决方案可能仅适用于唯一标识符的数量不是很大的情况。
#!/bin/bash
snp_expression=$(awk 'FNR>1{print $3}' file_2 | sort -u | paste -sd "|")
grep -E "^(${snp})[[:space:]]" file_1 > file_3
A more general solution which works for any position of the SNP field:适用于 SNP 字段的任何位置的更通用的解决方案:
# SO71009277.awk
BEGIN {
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["SNP"] = 1
}
else {
SNP_KEY[$FIELDBYNAME1["SNP"]] = $0
}
}
close(ARGV[1])
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["SNP"] = 3
}
else {
if ($FIELDBYNAME2["SNP"] in SNP_KEY)
print SNP_KEY[$FIELDBYNAME2["SNP"]]
}
}
close(ARGV[2])
}
Call:称呼:
awk -f SO71009277.awk file1.txt file2.txt
=>
rs12266638 t g 0.4259 0.0838 3.776e-07 +? 0.0 0.000 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.