简体   繁体   English

是否有代码可以从带有 ID 列表的大型 sdf 文件中提取完整的分子记录?

[英]Is there a code to extract complete molecular records from a large sdf file with a list of IDs?

I am using Phyton 3.7 in Spider我在 Spider 中使用 Phyton 3.7

I try to extract complete molecular records from a large sdf file with an small list of IDs listed in a txt file and writing them into one new sdf file.我尝试从一个大 sdf 文件中提取完整的分子记录,其中包含一个 txt 文件中列出的小 ID 列表,并将它们写入一个新的 sdf 文件。

More specifically, I have a selected list of about 500 chemical molecule IDs in lines, one ID per line (ten number digits each ID) whose molecular details are contained into a large sdf file of about 2G (300000 molecules, each record contains about 400 lines of code between their ID to the final $$$$ line)更具体地说,我有一个大约 500 个化学分子 ID 的选定列表,每行一个 ID(每个 ID 十位数),其分子详细信息包含在一个大约 2G 的大型 sdf 文件中(300000 个分子,每个记录包含大约 400他们的 ID 到最后的 $$$$ 行之间的代码行)

I need to extract the complete 500 records of the IDs into a single sdf file from the large sdf 2G file for further studies.我需要从大型 sdf 2G 文件中将完整的 500 条 ID 记录提取到单个 sdf 文件中以供进一步研究。

I tried the somehow similar and partial python scripts from the stackoverflow and google but not a single one work?我从stackoverflow和google尝试了某种相似的部分python脚本,但没有一个工作? Could anyone give a hint or a few lines of code to test?任何人都可以给出提示或几行代码来测试吗?

Thank you julio谢谢你,朱利奥

As suggested (thank you Andrej: great idea), to simplify the problem, I designed small samples of the files.按照建议(谢谢 Andrej:好主意),为了简化问题,我设计了文件的小样本。 Each line is separated by \n in the originals.在原始文件中,每一行都由 \n 分隔。 I added the positional information to each record to facilitate follow up of results.我将位置信息添加到每条记录中,以方便跟进结果。 f1.txt contains 3 IDs f2.sdf contains a simplifyed sample of the large 2G data base f3.sdf contains the desired file, in this case, for the 3 IDs f1.txt 包含 3 个 ID f2.sdf 包含大型 2G 数据库的简化样本 f3.sdf 包含所需的文件,在本例中为 3 个 ID

f1.txt f1.txt


f2.sdf f2.sdf

 MOLSOFT 05232012283D, 1 in the large sdf list

about 400 lines more of code大约 400 多行代码

 MOLSOFT 05232012283D, 2 in the large sdf list, 

about 400 lines more of code大约 400 多行代码

 MOLSOFT 05232012283D, 3 in the large sdf list

about 400 lines more of code大约 400 多行代码

 MOLSOFT 05232012283D, 4 in the large sdf list

about 400 lines more of code, one in the short txt list SN1大约 400 多行代码,一个在短 txt 列表 SN1 中

 MOLSOFT 05232012283D, 5 in the large sdf list

about 400 lines more of code大约 400 多行代码

 MOLSOFT 05232012283D, 6 in the large sdf list

about 400 lines more of code, one in the short txt list SN2大约 400 多行代码,一个在短 txt 列表 SN2 中

 MOLSOFT 05232012283D, 7 in the large sdf list

about 400 lines more of code大约 400 多行代码

  MOLSOFT 05232012283D, 8 in the large sdf list

about 400 lines more of code大约 400 多行代码

 MOLSOFT 05232012283D, 9 in the large sdf list

about 400 lines more of code, one in the short txt list SN3大约 400 多行代码,一个在短 txt 列表 SN3 中


f3.sdf f3.sdf

 MOLSOFT 05232012283D, 4 in the large sdf list

about 400 lines more of code, one in the short txt list SN1大约 400 多行代码,一个在短 txt 列表 SN1 中

 MOLSOFT 05232012283D, 6 in the large sdf list

about 400 lines more of code, one in the short txt list SN2大约 400 多行代码,一个在短 txt 列表 SN2 中

 MOLSOFT 05232012283D, 9 in the large sdf list

about 400 lines more of code, one in the short txt list SN3大约 400 多行代码,一个在短 txt 列表 SN3 中


You can use re module for the task:您可以将re模块用于该任务:

If f1.txt contains:如果f1.txt包含:


f2.sdf contains: f2.sdf包含:

 MOLSOFT 05232012283D

about 400 lines more of code

 MOLSOFT 05232012283D

about 400 lines more of code

 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

 MOLSOFT 05232012283D

about 400 lines more of code

 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

 MOLSOFT 05232012283D

about 400 lines more of code

  MOLSOFT 05232012283D

about 400 lines more of code

 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3


Then this script:然后这个脚本:

import re

with open('f1.txt', 'r') as f_in:
    desired_ids = set(line.strip() for line in f_in if line.strip())

expr = r'({}.*?^\s*\$\$\$\$)'.format(r'^\s*(?:' + r'|'.join(re.escape(i) for i in desired_ids) + r')')
r = re.compile(expr, flags=re.DOTALL|re.M)

with open('f2.sdf', 'r') as f_in, open('f3.sdf', 'w') as f_out:
    for m in r.finditer(f_in.read()):
        print(m.group(0), file=f_out)

Produces f3.sdf :产生f3.sdf

 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3



You can see the regex live on regex101你可以在regex101上看到正则表达式

The re.DOTALL means that the dot . re.DOTALL表示点. character matches newlines also.字符也匹配换行符。 The re.M (or re.MULTILINE ) means, that ^ character will match beginning of the line, not just beginning of the file. re.M (或re.MULTILINE )意味着^字符将匹配行的开头,而不仅仅是文件的开头。 More in official re documentation .更多在官方re文档中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM