[英]Is there a code to extract complete molecular records from a large sdf file with a list of IDs?
I am using Phyton 3.7 in Spider我在 Spider 中使用 Phyton 3.7
I try to extract complete molecular records from a large sdf file with an small list of IDs listed in a txt file and writing them into one new sdf file.我尝试从一个大 sdf 文件中提取完整的分子记录,其中包含一个 txt 文件中列出的小 ID 列表,并将它们写入一个新的 sdf 文件。
More specifically, I have a selected list of about 500 chemical molecule IDs in lines, one ID per line (ten number digits each ID) whose molecular details are contained into a large sdf file of about 2G (300000 molecules, each record contains about 400 lines of code between their ID to the final $$$$ line)更具体地说,我有一个大约 500 个化学分子 ID 的选定列表,每行一个 ID(每个 ID 十位数),其分子详细信息包含在一个大约 2G 的大型 sdf 文件中(300000 个分子,每个记录包含大约 400他们的 ID 到最后的 $$$$ 行之间的代码行)
I need to extract the complete 500 records of the IDs into a single sdf file from the large sdf 2G file for further studies.我需要从大型 sdf 2G 文件中将完整的 500 条 ID 记录提取到单个 sdf 文件中以供进一步研究。
I tried the somehow similar and partial python scripts from the stackoverflow and google but not a single one work?我从stackoverflow和google尝试了某种相似的部分python脚本,但没有一个工作? Could anyone give a hint or a few lines of code to test?
任何人都可以给出提示或几行代码来测试吗?
Thank you julio谢谢你,朱利奥
As suggested (thank you Andrej: great idea), to simplify the problem, I designed small samples of the files.按照建议(谢谢 Andrej:好主意),为了简化问题,我设计了文件的小样本。 Each line is separated by \n in the originals.
在原始文件中,每一行都由 \n 分隔。 I added the positional information to each record to facilitate follow up of results.
我将位置信息添加到每条记录中,以方便跟进结果。 f1.txt contains 3 IDs f2.sdf contains a simplifyed sample of the large 2G data base f3.sdf contains the desired file, in this case, for the 3 IDs
f1.txt 包含 3 个 ID f2.sdf 包含大型 2G 数据库的简化样本 f3.sdf 包含所需的文件,在本例中为 3 个 ID
f1.txt f1.txt
SN00061212
SN00134795
SN00107686
f2.sdf f2.sdf
SN00039109
MOLSOFT 05232012283D, 1 in the large sdf list
about 400 lines more of code大约 400 多行代码
$$$$
SN00357061
MOLSOFT 05232012283D, 2 in the large sdf list,
about 400 lines more of code大约 400 多行代码
$$$$
SN00134795
MOLSOFT 05232012283D, 3 in the large sdf list
about 400 lines more of code大约 400 多行代码
$$$$
SN00061212
MOLSOFT 05232012283D, 4 in the large sdf list
about 400 lines more of code, one in the short txt list SN1大约 400 多行代码,一个在短 txt 列表 SN1 中
$$$$
SN00134796
MOLSOFT 05232012283D, 5 in the large sdf list
about 400 lines more of code大约 400 多行代码
$$$$
SN00134795
MOLSOFT 05232012283D, 6 in the large sdf list
about 400 lines more of code, one in the short txt list SN2大约 400 多行代码,一个在短 txt 列表 SN2 中
$$$$
SN00333333
MOLSOFT 05232012283D, 7 in the large sdf list
about 400 lines more of code大约 400 多行代码
$$$$
SN00145791
MOLSOFT 05232012283D, 8 in the large sdf list
about 400 lines more of code大约 400 多行代码
$$$$
SN00107686
MOLSOFT 05232012283D, 9 in the large sdf list
about 400 lines more of code, one in the short txt list SN3大约 400 多行代码,一个在短 txt 列表 SN3 中
$$$$
f3.sdf f3.sdf
SN00061212
MOLSOFT 05232012283D, 4 in the large sdf list
about 400 lines more of code, one in the short txt list SN1大约 400 多行代码,一个在短 txt 列表 SN1 中
$$$$
SN00134795
MOLSOFT 05232012283D, 6 in the large sdf list
about 400 lines more of code, one in the short txt list SN2大约 400 多行代码,一个在短 txt 列表 SN2 中
$$$$
SN00107686
MOLSOFT 05232012283D, 9 in the large sdf list
about 400 lines more of code, one in the short txt list SN3大约 400 多行代码,一个在短 txt 列表 SN3 中
$$$$
You can use re
module for the task:您可以将
re
模块用于该任务:
If f1.txt
contains:如果
f1.txt
包含:
SN00061212
SN00134795
SN00107686
f2.sdf
contains: f2.sdf
包含:
SN00039109
MOLSOFT 05232012283D
about 400 lines more of code
$$$$
SN00357061
MOLSOFT 05232012283D
about 400 lines more of code
$$$$
SN00061212
MOLSOFT 05232012283D
about 400 lines more of code, one in the short txt list SN1
$$$$
SN00134796
MOLSOFT 05232012283D
about 400 lines more of code
$$$$
SN00134795
MOLSOFT 05232012283D
about 400 lines more of code, one in the short txt list SN2
$$$$
SN00333333
MOLSOFT 05232012283D
about 400 lines more of code
$$$$
SN00145791
MOLSOFT 05232012283D
about 400 lines more of code
$$$$
SN00107686
MOLSOFT 05232012283D
about 400 lines more of code, one in the short txt list SN3
$$$$
Then this script:然后这个脚本:
import re
with open('f1.txt', 'r') as f_in:
desired_ids = set(line.strip() for line in f_in if line.strip())
expr = r'({}.*?^\s*\$\$\$\$)'.format(r'^\s*(?:' + r'|'.join(re.escape(i) for i in desired_ids) + r')')
r = re.compile(expr, flags=re.DOTALL|re.M)
with open('f2.sdf', 'r') as f_in, open('f3.sdf', 'w') as f_out:
for m in r.finditer(f_in.read()):
print(m.group(0), file=f_out)
Produces f3.sdf
:产生
f3.sdf
:
SN00061212
MOLSOFT 05232012283D
about 400 lines more of code, one in the short txt list SN1
$$$$
SN00134795
MOLSOFT 05232012283D
about 400 lines more of code, one in the short txt list SN2
$$$$
SN00107686
MOLSOFT 05232012283D
about 400 lines more of code, one in the short txt list SN3
$$$$
EDIT:编辑:
You can see the regex live on regex101你可以在regex101上看到正则表达式
The re.DOTALL
means that the dot .
re.DOTALL
表示点.
character matches newlines also.字符也匹配换行符。 The
re.M
(or re.MULTILINE
) means, that ^
character will match beginning of the line, not just beginning of the file. re.M
(或re.MULTILINE
)意味着^
字符将匹配行的开头,而不仅仅是文件的开头。 More in official re
documentation .更多在官方
re
文档中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.