简体   繁体   English

是否有代码可以从带有 ID 列表的大型 sdf 文件中提取完整的分子记录?

[英]Is there a code to extract complete molecular records from a large sdf file with a list of IDs?

I am using Phyton 3.7 in Spider我在 Spider 中使用 Phyton 3.7

I try to extract complete molecular records from a large sdf file with an small list of IDs listed in a txt file and writing them into one new sdf file.我尝试从一个大 sdf 文件中提取完整的分子记录,其中包含一个 txt 文件中列出的小 ID 列表,并将它们写入一个新的 sdf 文件。

More specifically, I have a selected list of about 500 chemical molecule IDs in lines, one ID per line (ten number digits each ID) whose molecular details are contained into a large sdf file of about 2G (300000 molecules, each record contains about 400 lines of code between their ID to the final $$$$ line)更具体地说,我有一个大约 500 个化学分子 ID 的选定列表,每行一个 ID(每个 ID 十位数),其分子详细信息包含在一个大约 2G 的大型 sdf 文件中(300000 个分子,每个记录包含大约 400他们的 ID 到最后的 $$$$ 行之间的代码行)

I need to extract the complete 500 records of the IDs into a single sdf file from the large sdf 2G file for further studies.我需要从大型 sdf 2G 文件中将完整的 500 条 ID 记录提取到单个 sdf 文件中以供进一步研究。

I tried the somehow similar and partial python scripts from the stackoverflow and google but not a single one work?我从stackoverflow和google尝试了某种相似的部分python脚本,但没有一个工作? Could anyone give a hint or a few lines of code to test?任何人都可以给出提示或几行代码来测试吗?

Thank you julio谢谢你,朱利奥

As suggested (thank you Andrej: great idea), to simplify the problem, I designed small samples of the files.按照建议(谢谢 Andrej:好主意),为了简化问题,我设计了文件的小样本。 Each line is separated by \n in the originals.在原始文件中,每一行都由 \n 分隔。 I added the positional information to each record to facilitate follow up of results.我将位置信息添加到每条记录中,以方便跟进结果。 f1.txt contains 3 IDs f2.sdf contains a simplifyed sample of the large 2G data base f3.sdf contains the desired file, in this case, for the 3 IDs f1.txt 包含 3 个 ID f2.sdf 包含大型 2G 数据库的简化样本 f3.sdf 包含所需的文件,在本例中为 3 个 ID

f1.txt f1.txt

SN00061212
SN00134795
SN00107686

f2.sdf f2.sdf

SN00039109
 MOLSOFT 05232012283D, 1 in the large sdf list

about 400 lines more of code大约 400 多行代码

$$$$
SN00357061
 MOLSOFT 05232012283D, 2 in the large sdf list, 

about 400 lines more of code大约 400 多行代码

$$$$
SN00134795
 MOLSOFT 05232012283D, 3 in the large sdf list

about 400 lines more of code大约 400 多行代码

   $$$$
SN00061212
 MOLSOFT 05232012283D, 4 in the large sdf list

about 400 lines more of code, one in the short txt list SN1大约 400 多行代码,一个在短 txt 列表 SN1 中

  $$$$
SN00134796
 MOLSOFT 05232012283D, 5 in the large sdf list

about 400 lines more of code大约 400 多行代码

  $$$$
SN00134795
 MOLSOFT 05232012283D, 6 in the large sdf list

about 400 lines more of code, one in the short txt list SN2大约 400 多行代码,一个在短 txt 列表 SN2 中

  $$$$
SN00333333
 MOLSOFT 05232012283D, 7 in the large sdf list

about 400 lines more of code大约 400 多行代码

  $$$$
SN00145791
  MOLSOFT 05232012283D, 8 in the large sdf list

about 400 lines more of code大约 400 多行代码

  $$$$
SN00107686
 MOLSOFT 05232012283D, 9 in the large sdf list

about 400 lines more of code, one in the short txt list SN3大约 400 多行代码,一个在短 txt 列表 SN3 中

$$$$ 

f3.sdf f3.sdf

SN00061212
 MOLSOFT 05232012283D, 4 in the large sdf list

about 400 lines more of code, one in the short txt list SN1大约 400 多行代码,一个在短 txt 列表 SN1 中

  $$$$
SN00134795
 MOLSOFT 05232012283D, 6 in the large sdf list

about 400 lines more of code, one in the short txt list SN2大约 400 多行代码,一个在短 txt 列表 SN2 中

  $$$$
SN00107686
 MOLSOFT 05232012283D, 9 in the large sdf list

about 400 lines more of code, one in the short txt list SN3大约 400 多行代码,一个在短 txt 列表 SN3 中

$$$$

You can use re module for the task:您可以将re模块用于该任务:

If f1.txt contains:如果f1.txt包含:

SN00061212
SN00134795
SN00107686

f2.sdf contains: f2.sdf包含:

SN00039109
 MOLSOFT 05232012283D

about 400 lines more of code

$$$$
SN00357061
 MOLSOFT 05232012283D

about 400 lines more of code

$$$$
SN00061212
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

  $$$$
SN00134796
 MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00134795
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

  $$$$
SN00333333
 MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00145791
  MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00107686
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3

$$$$

Then this script:然后这个脚本:

import re

with open('f1.txt', 'r') as f_in:
    desired_ids = set(line.strip() for line in f_in if line.strip())

expr = r'({}.*?^\s*\$\$\$\$)'.format(r'^\s*(?:' + r'|'.join(re.escape(i) for i in desired_ids) + r')')
r = re.compile(expr, flags=re.DOTALL|re.M)

with open('f2.sdf', 'r') as f_in, open('f3.sdf', 'w') as f_out:
    for m in r.finditer(f_in.read()):
        print(m.group(0), file=f_out)

Produces f3.sdf :产生f3.sdf

SN00061212
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

  $$$$
SN00134795
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

  $$$$
SN00107686
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3

$$$$

EDIT:编辑:

You can see the regex live on regex101你可以在regex101上看到正则表达式

The re.DOTALL means that the dot . re.DOTALL表示点. character matches newlines also.字符也匹配换行符。 The re.M (or re.MULTILINE ) means, that ^ character will match beginning of the line, not just beginning of the file. re.M (或re.MULTILINE )意味着^字符将匹配行的开头,而不仅仅是文件的开头。 More in official re documentation .更多在官方re文档中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM