简体   繁体   English

根据另一个文件中提供的ID从SDF文件中顺序提取分子

[英]Extract molecules in order from SDF file according to IDs given in another file

I have an SDFile containing thousands of molecules and I need to extract from it molecules according to their IDs given in a simple one column file. 我有一个包含数千个分子的SDFile,我需要根据一个简单的单列文件中提供的ID从分子中提取分子。 So, the example of the SDF will be file1.sdf: 因此,SDF的示例为file1.sdf:

MOL108108
  -Chem-8567890432

 15 15  0     0  0  0  0  0  0999 V2000
    6.1792   -2.6875    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.9542   -2.6875    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.4125   -2.7167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.1667   -3.4667    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.1667   -1.9000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.7375   -3.4625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.1000   -2.7667    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    3.1500   -4.1292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.0542   -3.3792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0167   -2.0542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.8792   -2.7542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.2542   -3.7125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2500   -2.0792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2875   -3.4042    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9542   -3.4875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  6  7  1  0  0  0  0
  7 11  1  0  0  0  0
  6  8  2  0  0  0  0
  3  9  1  0  0  0  0
  3 10  2  0  0  0  0
 11 13  2  0  0  0  0
  2 12  1  0  0  0  0
 10 13  1  0  0  0  0
  9 14  2  0  0  0  0
  6 15  1  0  0  0  0
 11 14  1  0  0  0  0
M  END
> <mol_id>
MOL108108

$$$$
MOL16520
  -Chem4051902312

 22 21  0     1  0  0  0  0  0999 V2000
    0.2750    0.1500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2458   -0.1500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7917   -0.1500    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    1.3167    0.1458    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7625    0.1583    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   -1.8083    0.1583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7917   -0.7500    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   -1.2833   -0.1417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2458   -0.7500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.3167    0.7458    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7625    0.7583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.8000    0.7583    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.8292   -0.1542    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.3208   -0.1417    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.8375    0.1583    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.6083    1.3333    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
    1.3125   -1.0542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7875   -1.3500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.2750   -1.0500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3542    0.1458    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0375    1.7583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0333    1.4875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  3  4  1  0  0  0  0
  2  5  1  0  0  0  0
  6  8  1  0  0  0  0
  3  7  1  6  0  0  0
  5  8  1  0  0  0  0
  2  9  2  0  0  0  0
  4 10  2  0  0  0  0
  5 11  1  1  0  0  0
  6 12  2  0  0  0  0
  4 13  1  0  0  0  0
  6 14  1  0  0  0  0
 14 15  1  0  0  0  0
 11 16  1  0  0  0  0
  7 17  1  0  0  0  0
  7 18  1  0  0  0  0
  7 19  1  0  0  0  0
 13 20  1  0  0  0  0
 16 21  1  0  0  0  0
 16 22  1  0  0  0  0
M  END
> <mol_id>
MOL16520

$$$$
MOL55310
  -Chem04051902312

 11 11  0     0  0  0  0  0  0999 V2000
    6.7292   -1.5750    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    7.5542   -1.5750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    6.7250   -2.4000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.7292   -0.7500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.9125   -1.5917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.9667   -0.8542    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.5167   -2.3292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4792   -0.8917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.6542   -0.9167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.6917   -2.3417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2625   -1.6375    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  2  0  0  0  0
  1  4  2  0  0  0  0
  1  5  1  0  0  0  0
  2  6  1  0  0  0  0
  5  7  2  0  0  0  0
  5  8  1  0  0  0  0
  8  9  2  0  0  0  0
  7 10  1  0  0  0  0
  9 11  1  0  0  0  0
 10 11  2  0  0  0  0
M  END
> <mol_id>
MOL55310

$$$$

.........

And this is example of the IDs file file2: 这是ID文件file2的示例:

MOL101103
MOL103108
MOL108108

I use awk: awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf 我使用awk: awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf

but the resulting output is not ordered, I need to extract molecules from file1.sdf corresponding and ordered as in file2, so that the output will be an SDF like this: 但是结果输出没有排序,我需要从file1.sdf中提取与file2中对应并有序的分子,这样输出将是这样的SDF:

MOL101103
  -Chem-6789043209

12 12  0     0  0  0  0  0  0999 V2000
    5.5667   -2.7625    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.3292   -2.7625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.8292   -2.7917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6292   -3.7167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.5542   -2.0042    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.4375   -2.1500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.4792   -3.4375    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.7667   -3.9167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7417   -3.4542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6917   -2.1750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.3500   -2.8292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5917   -2.8417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  3  6  2  0  0  0  0
  3  7  1  0  0  0  0
  2  8  1  0  0  0  0
  7  9  2  0  0  0  0
  6 10  1  0  0  0  0
  9 11  1  0  0  0  0
 11 12  1  0  0  0  0
 10 11  2  0  0  0  0
M  END
> <mol_id>
MOL101103

$$$$
MOL103108
  -Chem-6789005434

14 14  0     0  0  0  0  0  0999 V2000
    5.9250   -2.8417    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    2.8875   -2.9292    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    6.6917   -2.8417    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1667   -2.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6542   -2.9125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.9167   -3.6167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.9167   -2.0667    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.7042   -3.9042    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    2.4042   -2.1500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.8167   -3.5292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.7792   -2.2167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0167   -2.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0542   -3.5542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0125   -3.7792    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  5  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  5 12  2  0  0  0  0
  1  6  2  0  0  0  0
  1  7  2  0  0  0  0
  2  8  1  0  0  0  0
  2  9  2  0  0  0  0
  4 10  1  0  0  0  0
  4 11  2  0  0  0  0
 11 12  1  0  0  0  0
 10 13  2  0  0  0  0
  3 14  1  0  0  0  0
  5 13  1  0  0  0  0
M  CHG  2   2   1   8  -1
M  END
> <mol_id>
MOL103108

$$$$
MOL108108
  -Chem-8567890432

12 12  0     0  0  0  0  0  0999 V2000
    5.8875   -2.8500    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.6500   -2.8500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1542   -2.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.8792   -3.7292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.8792   -2.0875    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.7542   -2.2292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8000   -3.5167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6667   -2.9125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.9417   -3.8125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.9125   -2.9292    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.0167   -2.2625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0667   -3.5417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  3  6  2  0  0  0  0
  3  7  1  0  0  0  0
  8 12  1  0  0  0  0
  2  9  1  0  0  0  0
  8 10  1  0  0  0  0
  6 11  1  0  0  0  0
  7 12  2  0  0  0  0
  8 11  2  0  0  0  0
M  END
> <mol_id>
MOL108108

$$$$

......

So the first molecule of the output file is the first molecule of the ID file and so on. 因此,输出文件的第一个分子是ID文件的第一个分子,依此类推。 Thank you! 谢谢!

I couldn't figure out your input format or where several of the data items in your output were coming from but this is a general way to do what you want in terms of printing records from file1 in the order of their ids from file2: 我无法弄清楚您的输入格式或输出中的几个数据项来自何处,但这是按照文件1中的ID从文件2中的ID顺序打印文件1中的记录的一般方法:

$ cat tst.awk
NR==FNR {
    idSet[$0]
    idOrder[++numIds] = $0
    next
}
$1 in idSet { id = $1 }
$1 !~ /^[0-9.]+$/ {
    rec[id] = rec[id] $0 ORS
}
END {
    for (idNr=1; idNr<=numIds; idNr++) {
        id = idOrder[idNr]
        if (id in rec) {
            print rec[id]
        }
    }
}

.

$ awk -f tst.awk file2 file1
MOL108108
  -Chem-8567890432

M  END
> <mol_id>
MOL108108

$$$$
MOL450987
[…]
M  END
> <mol_id>
MOL450987

$$$$

Massage to suit. 按摩以适合。

Adoption you original awk: 通过您原来的awk:

awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
     (NR==FNR){a[$1]=$0; next}
     ($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt

The idea is to read the SDF-file into memory, record by record. 想法是将SDF文件读入内存,逐条记录。

  • The record separator is $$$$ . 记录分隔符为$$$$ You can set this in Gnu awk as RS="\\\\$\\\\$\\\\$\\\\$" . 您可以在Gnu awk中将此设置为RS="\\\\$\\\\$\\\\$\\\\$" Here you need to escape the $ as it has a special meaning as a regex (anchor to the end). 在这里,您需要转义$因为它作为正则表达式(末尾的锚)具有特殊含义。 There is a double escape ongoing. 正在进行两次越狱。 Escape one is the lexographic parser or awk converting \\\\$ into \\$ which is then the proper escaped $ . 转义词之一是词法分析器或awk将\\\\$转换为\\$ ,然后是正确的转义$

  • The output record separator (the one used when printing records) is just ORS="$$$$" . 输出记录分隔符(打印记录时使用的分隔符)只是ORS="$$$$" Here we do not need to escape it as it is a normal string. 这里我们不需要转义它,因为它是普通字符串。

For the first file, (NR==FNR) we store the full records $0 in an array indexed by the first field (molecule name). 对于第一个文件(NR==FNR)我们将完整记录$0存储在由第一个字段(分子名称)索引的数组中。 ( a[$1]=$0 ). a[$1]=$0 )。

The second file has a normal record separator as a new-line ( RS="\\n" ). 第二个文件具有常规记录分隔符作为换行符( RS="\\n" )。 So every time we read a record, we check if it is an element of a and if so, print it. 所以每次我们读创纪录的时间,我们检查,如果它是一个元素a ,如果是这样,打印出来。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM