简体繁体 English

是否有代码可以从带有 ID 列表的大型 sdf 文件中提取完整的分子记录？

[英]Is there a code to extract complete molecular records from a large sdf file with a list of IDs?

原文 2020-05-25 19:11:05 0 1 python-3.x/ string/ dictionary/ for-loop/ sdf

I am using Phyton 3.7 in Spider我在 Spider 中使用 Phyton 3.7

I try to extract complete molecular records from a large sdf file with an small list of IDs listed in a txt file and writing them into one new sdf file.我尝试从一个大 sdf 文件中提取完整的分子记录，其中包含一个 txt 文件中列出的小 ID 列表，并将它们写入一个新的 sdf 文件。

More specifically, I have a selected list of about 500 chemical molecule IDs in lines, one ID per line (ten number digits each ID) whose molecular details are contained into a large sdf file of about 2G (300000 molecules, each record contains about 400 lines of code between their ID to the final $$$$ line)更具体地说，我有一个大约 500 个化学分子 ID 的选定列表，每行一个 ID（每个 ID 十位数），其分子详细信息包含在一个大约 2G 的大型 sdf 文件中（300000 个分子，每个记录包含大约 400他们的 ID 到最后的 $$$$ 行之间的代码行）

I need to extract the complete 500 records of the IDs into a single sdf file from the large sdf 2G file for further studies.我需要从大型 sdf 2G 文件中将完整的 500 条 ID 记录提取到单个 sdf 文件中以供进一步研究。

I tried the somehow similar and partial python scripts from the stackoverflow and google but not a single one work?我从stackoverflow和google尝试了某种相似的部分python脚本，但没有一个工作？ Could anyone give a hint or a few lines of code to test?任何人都可以给出提示或几行代码来测试吗？

Thank you julio谢谢你，朱利奥

As suggested (thank you Andrej: great idea), to simplify the problem, I designed small samples of the files.按照建议（谢谢 Andrej：好主意），为了简化问题，我设计了文件的小样本。 Each line is separated by \n in the originals.在原始文件中，每一行都由 \n 分隔。 I added the positional information to each record to facilitate follow up of results.我将位置信息添加到每条记录中，以方便跟进结果。 f1.txt contains 3 IDs f2.sdf contains a simplifyed sample of the large 2G data base f3.sdf contains the desired file, in this case, for the 3 IDs f1.txt 包含 3 个 ID f2.sdf 包含大型 2G 数据库的简化样本 f3.sdf 包含所需的文件，在本例中为 3 个 ID

f1.txt f1.txt

SN00061212
SN00134795
SN00107686

f2.sdf f2.sdf

SN00039109
 MOLSOFT 05232012283D, 1 in the large sdf list

about 400 lines more of code大约 400 多行代码

$$$$
SN00357061
 MOLSOFT 05232012283D, 2 in the large sdf list,

about 400 lines more of code大约 400 多行代码

$$$$
SN00134795
 MOLSOFT 05232012283D, 3 in the large sdf list

about 400 lines more of code大约 400 多行代码

   $$$$
SN00061212
 MOLSOFT 05232012283D, 4 in the large sdf list

about 400 lines more of code, one in the short txt list SN1大约 400 多行代码，一个在短 txt 列表 SN1 中

  $$$$
SN00134796
 MOLSOFT 05232012283D, 5 in the large sdf list

about 400 lines more of code大约 400 多行代码

  $$$$
SN00134795
 MOLSOFT 05232012283D, 6 in the large sdf list

about 400 lines more of code, one in the short txt list SN2大约 400 多行代码，一个在短 txt 列表 SN2 中

  $$$$
SN00333333
 MOLSOFT 05232012283D, 7 in the large sdf list

about 400 lines more of code大约 400 多行代码

  $$$$
SN00145791
  MOLSOFT 05232012283D, 8 in the large sdf list

about 400 lines more of code大约 400 多行代码

  $$$$
SN00107686
 MOLSOFT 05232012283D, 9 in the large sdf list

about 400 lines more of code, one in the short txt list SN3大约 400 多行代码，一个在短 txt 列表 SN3 中

$$$$

f3.sdf f3.sdf

SN00061212
 MOLSOFT 05232012283D, 4 in the large sdf list

about 400 lines more of code, one in the short txt list SN1大约 400 多行代码，一个在短 txt 列表 SN1 中

  $$$$
SN00134795
 MOLSOFT 05232012283D, 6 in the large sdf list

about 400 lines more of code, one in the short txt list SN2大约 400 多行代码，一个在短 txt 列表 SN2 中

  $$$$
SN00107686
 MOLSOFT 05232012283D, 9 in the large sdf list

about 400 lines more of code, one in the short txt list SN3大约 400 多行代码，一个在短 txt 列表 SN3 中

$$$$

1 个解决方案

You can use re module for the task:您可以将re模块用于该任务：

If f1.txt contains:如果f1.txt包含：

SN00061212
SN00134795
SN00107686

f2.sdf contains: f2.sdf包含：

SN00039109
 MOLSOFT 05232012283D

about 400 lines more of code

$$$$
SN00357061
 MOLSOFT 05232012283D

about 400 lines more of code

$$$$
SN00061212
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

  $$$$
SN00134796
 MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00134795
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

  $$$$
SN00333333
 MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00145791
  MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00107686
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3

$$$$

Then this script:然后这个脚本：

import re

with open('f1.txt', 'r') as f_in:
    desired_ids = set(line.strip() for line in f_in if line.strip())

expr = r'({}.*?^\s*\$\$\$\$)'.format(r'^\s*(?:' + r'|'.join(re.escape(i) for i in desired_ids) + r')')
r = re.compile(expr, flags=re.DOTALL|re.M)

with open('f2.sdf', 'r') as f_in, open('f3.sdf', 'w') as f_out:
    for m in r.finditer(f_in.read()):
        print(m.group(0), file=f_out)

Produces f3.sdf :产生f3.sdf ：

SN00061212
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

  $$$$
SN00134795
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

  $$$$
SN00107686
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3

$$$$

EDIT:编辑：

You can see the regex live on regex101你可以在regex101上看到正则表达式

The re.DOTALL means that the dot . re.DOTALL表示点. character matches newlines also.字符也匹配换行符。 The re.M (or re.MULTILINE ) means, that ^ character will match beginning of the line, not just beginning of the file. re.M （或re.MULTILINE ）意味着^字符将匹配行的开头，而不仅仅是文件的开头。 More in official re documentation .更多在官方re文档中。

如何从单个记录而不是完整文件中解析数据 - How to parse data from individual records instead of the complete file

从文本文件中提取特定记录并保存到 Python 中的新文件 - Extract specific records from a text file and save to a new file in Python

从大字符串中提取子字符串 - extract substring from large string

从列表中提取列表 - Extract list from list

提取电子邮件：姓名：电话：来自相似记录文件的相邻行 - Extract Email: Name: Phone: from adjacent lines from file of similar records

如何快速从大文件中搜索列表内容？ - How to search content of a list from a large file quickly?

Python定义大型查询字符串列表与从文件读取 - Python defining a list of large query strings vs reading in from a file

如何从CSV文件中提取单词并存储到Python中的列表 - How to extract words from CSV file and store to list in Python

如何从 Python 中的文件中提取唯一电子邮件地址列表 - How to extract list of unique email addresses from a file in Python

如何修复Python 3代码以从文本文件中提取特定行 - How to fix Python 3 code to extract specific lines from a text file

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从单个记录而不是完整文件中解析数据 - How to parse data from individual records instead of the complete file 从文本文件中提取特定记录并保存到 Python 中的新文件 - Extract specific records from a text file and save to a new file in Python 从大字符串中提取子字符串 - extract substring from large string 从列表中提取列表 - Extract list from list 提取电子邮件：姓名：电话：来自相似记录文件的相邻行 - Extract Email: Name: Phone: from adjacent lines from file of similar records 如何快速从大文件中搜索列表内容？ - How to search content of a list from a large file quickly? Python定义大型查询字符串列表与从文件读取 - Python defining a list of large query strings vs reading in from a file 如何从CSV文件中提取单词并存储到Python中的列表 - How to extract words from CSV file and store to list in Python 如何从 Python 中的文件中提取唯一电子邮件地址列表 - How to extract list of unique email addresses from a file in Python 如何修复Python 3代码以从文本文件中提取特定行 - How to fix Python 3 code to extract specific lines from a text file

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM