[英]Extract molecules in order from SDF file according to IDs given in another file
I have an SDFile containing thousands of molecules and I need to extract from it molecules according to their IDs given in a simple one column file. 我有一个包含数千个分子的SDFile,我需要根据一个简单的单列文件中提供的ID从分子中提取分子。 So, the example of the SDF will be file1.sdf:
因此,SDF的示例为file1.sdf:
MOL108108
-Chem-8567890432
15 15 0 0 0 0 0 0 0999 V2000
6.1792 -2.6875 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.9542 -2.6875 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.4125 -2.7167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.1667 -3.4667 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.1667 -1.9000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.7375 -3.4625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.1000 -2.7667 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.1500 -4.1292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.0542 -3.3792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0167 -2.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.8792 -2.7542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.2542 -3.7125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2500 -2.0792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2875 -3.4042 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9542 -3.4875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
6 7 1 0 0 0 0
7 11 1 0 0 0 0
6 8 2 0 0 0 0
3 9 1 0 0 0 0
3 10 2 0 0 0 0
11 13 2 0 0 0 0
2 12 1 0 0 0 0
10 13 1 0 0 0 0
9 14 2 0 0 0 0
6 15 1 0 0 0 0
11 14 1 0 0 0 0
M END
> <mol_id>
MOL108108
$$$$
MOL16520
-Chem4051902312
22 21 0 1 0 0 0 0 0999 V2000
0.2750 0.1500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.2458 -0.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7917 -0.1500 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
1.3167 0.1458 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.7625 0.1583 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
-1.8083 0.1583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7917 -0.7500 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
-1.2833 -0.1417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2458 -0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.3167 0.7458 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.7625 0.7583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8000 0.7583 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.8292 -0.1542 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-2.3208 -0.1417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-2.8375 0.1583 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.6083 1.3333 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
1.3125 -1.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7875 -1.3500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2750 -1.0500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.3542 0.1458 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0375 1.7583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0333 1.4875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
3 4 1 0 0 0 0
2 5 1 0 0 0 0
6 8 1 0 0 0 0
3 7 1 6 0 0 0
5 8 1 0 0 0 0
2 9 2 0 0 0 0
4 10 2 0 0 0 0
5 11 1 1 0 0 0
6 12 2 0 0 0 0
4 13 1 0 0 0 0
6 14 1 0 0 0 0
14 15 1 0 0 0 0
11 16 1 0 0 0 0
7 17 1 0 0 0 0
7 18 1 0 0 0 0
7 19 1 0 0 0 0
13 20 1 0 0 0 0
16 21 1 0 0 0 0
16 22 1 0 0 0 0
M END
> <mol_id>
MOL16520
$$$$
MOL55310
-Chem04051902312
11 11 0 0 0 0 0 0 0999 V2000
6.7292 -1.5750 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
7.5542 -1.5750 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.7250 -2.4000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.7292 -0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.9125 -1.5917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.9667 -0.8542 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.5167 -2.3292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4792 -0.8917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.6542 -0.9167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.6917 -2.3417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2625 -1.6375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 2 0 0 0 0
1 4 2 0 0 0 0
1 5 1 0 0 0 0
2 6 1 0 0 0 0
5 7 2 0 0 0 0
5 8 1 0 0 0 0
8 9 2 0 0 0 0
7 10 1 0 0 0 0
9 11 1 0 0 0 0
10 11 2 0 0 0 0
M END
> <mol_id>
MOL55310
$$$$
.........
And this is example of the IDs file file2: 这是ID文件file2的示例:
MOL101103
MOL103108
MOL108108
I use awk: awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf
我使用awk:
awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf
but the resulting output is not ordered, I need to extract molecules from file1.sdf corresponding and ordered as in file2, so that the output will be an SDF like this: 但是结果输出没有排序,我需要从file1.sdf中提取与file2中对应并有序的分子,这样输出将是这样的SDF:
MOL101103
-Chem-6789043209
12 12 0 0 0 0 0 0 0999 V2000
5.5667 -2.7625 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.3292 -2.7625 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.8292 -2.7917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.6292 -3.7167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.5542 -2.0042 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.4375 -2.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4792 -3.4375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.7667 -3.9167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.7417 -3.4542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6917 -2.1750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3500 -2.8292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.5917 -2.8417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
3 6 2 0 0 0 0
3 7 1 0 0 0 0
2 8 1 0 0 0 0
7 9 2 0 0 0 0
6 10 1 0 0 0 0
9 11 1 0 0 0 0
11 12 1 0 0 0 0
10 11 2 0 0 0 0
M END
> <mol_id>
MOL101103
$$$$
MOL103108
-Chem-6789005434
14 14 0 0 0 0 0 0 0999 V2000
5.9250 -2.8417 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
2.8875 -2.9292 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
6.6917 -2.8417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.1667 -2.8750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6542 -2.9125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.9167 -3.6167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.9167 -2.0667 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.7042 -3.9042 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
2.4042 -2.1500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.8167 -3.5292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.7792 -2.2167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0167 -2.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0542 -3.5542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0125 -3.7792 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2 5 1 0 0 0 0
1 3 1 0 0 0 0
1 4 1 0 0 0 0
5 12 2 0 0 0 0
1 6 2 0 0 0 0
1 7 2 0 0 0 0
2 8 1 0 0 0 0
2 9 2 0 0 0 0
4 10 1 0 0 0 0
4 11 2 0 0 0 0
11 12 1 0 0 0 0
10 13 2 0 0 0 0
3 14 1 0 0 0 0
5 13 1 0 0 0 0
M CHG 2 2 1 8 -1
M END
> <mol_id>
MOL103108
$$$$
MOL108108
-Chem-8567890432
12 12 0 0 0 0 0 0 0999 V2000
5.8875 -2.8500 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.6500 -2.8500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.1542 -2.8750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.8792 -3.7292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8792 -2.0875 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.7542 -2.2292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8000 -3.5167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6667 -2.9125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.9417 -3.8125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9125 -2.9292 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.0167 -2.2625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0667 -3.5417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
3 6 2 0 0 0 0
3 7 1 0 0 0 0
8 12 1 0 0 0 0
2 9 1 0 0 0 0
8 10 1 0 0 0 0
6 11 1 0 0 0 0
7 12 2 0 0 0 0
8 11 2 0 0 0 0
M END
> <mol_id>
MOL108108
$$$$
......
So the first molecule of the output file is the first molecule of the ID file and so on. 因此,输出文件的第一个分子是ID文件的第一个分子,依此类推。 Thank you!
谢谢!
I couldn't figure out your input format or where several of the data items in your output were coming from but this is a general way to do what you want in terms of printing records from file1 in the order of their ids from file2: 我无法弄清楚您的输入格式或输出中的几个数据项来自何处,但这是按照文件1中的ID从文件2中的ID顺序打印文件1中的记录的一般方法:
$ cat tst.awk
NR==FNR {
idSet[$0]
idOrder[++numIds] = $0
next
}
$1 in idSet { id = $1 }
$1 !~ /^[0-9.]+$/ {
rec[id] = rec[id] $0 ORS
}
END {
for (idNr=1; idNr<=numIds; idNr++) {
id = idOrder[idNr]
if (id in rec) {
print rec[id]
}
}
}
. 。
$ awk -f tst.awk file2 file1
MOL108108
-Chem-8567890432
M END
> <mol_id>
MOL108108
$$$$
MOL450987
[…]
M END
> <mol_id>
MOL450987
$$$$
Massage to suit. 按摩以适合。
Adoption you original awk: 通过您原来的awk:
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
(NR==FNR){a[$1]=$0; next}
($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
The idea is to read the SDF-file into memory, record by record. 想法是将SDF文件读入内存,逐条记录。
The record separator is $$$$
. 记录分隔符为
$$$$
。 You can set this in Gnu awk as RS="\\\\$\\\\$\\\\$\\\\$"
. 您可以在Gnu awk中将此设置为
RS="\\\\$\\\\$\\\\$\\\\$"
。 Here you need to escape the $
as it has a special meaning as a regex (anchor to the end). 在这里,您需要转义
$
因为它作为正则表达式(末尾的锚)具有特殊含义。 There is a double escape ongoing. 正在进行两次越狱。 Escape one is the lexographic parser or awk converting
\\\\$
into \\$
which is then the proper escaped $
. 转义词之一是词法分析器或awk将
\\\\$
转换为\\$
,然后是正确的转义$
。
The output record separator (the one used when printing records) is just ORS="$$$$"
. 输出记录分隔符(打印记录时使用的分隔符)只是
ORS="$$$$"
。 Here we do not need to escape it as it is a normal string. 这里我们不需要转义它,因为它是普通字符串。
For the first file, (NR==FNR)
we store the full records $0
in an array indexed by the first field (molecule name). 对于第一个文件
(NR==FNR)
我们将完整记录$0
存储在由第一个字段(分子名称)索引的数组中。 ( a[$1]=$0
). (
a[$1]=$0
)。
The second file has a normal record separator as a new-line ( RS="\\n"
). 第二个文件具有常规记录分隔符作为换行符(
RS="\\n"
)。 So every time we read a record, we check if it is an element of a
and if so, print it. 所以每次我们读创纪录的时间,我们检查,如果它是一个元素
a
,如果是这样,打印出来。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.