简体   繁体   English

如何使用Python从输入文本(PDB文件)的多行中收集和收集单词?

[英]How to collect and gather words from multiple lines in a input text (PDB file) using Python?

I am coding a Python script that collect words in a text file (PDB file), later gathering them in phrases. 我正在编写一个Python脚本,该脚本收集文本文件(PDB文件)中的单词,然后以短语形式收集它们。 However, as I am just a beginner in programming, I'm having huge difficulties in doing it. 但是,由于我只是编程的初学者,因此在执行该程序时遇到了很大的困难。 I know how to do it just one line per time. 我知道怎么做,每次只需一行。 I wish you guys could give me some help. 我希望你们能给我一些帮助。 Please. 请。

The text has informations about sites of a protein. 文本包含有关蛋白质位点的信息。 Each site has four dedicated lines of information, as you can see below: 每个站点都有四个专用的信息 ,如下所示:

REMARK 800  
REMARK 800 SITE_IDENTIFIER: CC1                                                 
REMARK 800 EVIDENCE_CODE: SOFTWARE                                              
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE EDO A 326                 
REMARK 800                                                                      
REMARK 800 SITE_IDENTIFIER: DF8                                                 
REMARK 800 EVIDENCE_CODE: AUTHOR                                             
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE HEM T 238
REMARK 800                                                                      
REMARK 800 SITE_IDENTIFIER: FC7                                                 
REMARK 800 EVIDENCE_CODE: SOFTWARE                                              
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE NAG D 1001 

#and so on ...

An extended exemple is seen in the following link (search for "REMARK 800"): http://www.pdb.org/pdb/files/3HDL.pdb 在下面的链接(搜索“ REMARK 800”)中可以看到一个扩展的示例: http : //www.pdb.org/pdb/files/3HDL.pdb

As observed, 观察到

  • The 1st line has nothing. 第一行没有任何内容。 (It just separates one information from the next) (它只是将一个信息与下一个信息分开)
  • the 2nd has the SITE_IDENTIFIER . 第二个具有SITE_IDENTIFIER (eg CC1) (例如CC1)
  • the 3rd , the EVIDENCE_CODE . 第三个EVIDENCE_CODE (eg SOFTWARE) (例如软件)
  • the 4th , some RESIDUE informations. 第四 ,一些残渣信息。 (eg EDO A 326) (例如EDO A 326)

This pattern is seen in a great part of the text. 在本书的大部分内容中都可以看到这种模式。

What I want to do is to gather some words from three of the four consecutive dedicated lines, in such way that they are put together in one phrase. 我要做的是从四个连续的专用行中的三个中收集一些单词,以一种方式将它们组合在一起。 The necessary informations are the SITE_IDENTIFIER , the EVIDENCE_CODE , and 3 words from SITE_DESCRIPTION . 必要的信息是SITE_IDENTIFIEREVIDENCE_CODESITE_DESCRIPTION中的 3个单词 Thus, concerning the text excerpt above, the resulting phrases would be something like this: 因此,关于上面的文本摘录,生成的短语将如下所示:

CC1 SOFTWARE EDO A 326
DF8 AUTHOR HEM T 238
FC7 SOFTWARE NAG D 1001

#and so on...

Is it possible to do? 有可能吗? If so, can you guys imagine how can I do this? 如果是这样,你们能想象我该怎么做吗?

I tried doing it this way, but I feel like it is not going to work at all: 我尝试过以这种方式进行操作,但是我感觉它根本无法工作:

name_file = "3HDL.pdb"

pdb_file = open(name_file,"r")

for line in pdb_file:
    list = line.split()

    list_2=[]
    for j in range(0, 15):
        list_2.append("")

    if (list[0] == "REMARK" and list[1] == "800"):
        j=0
        while not j == len(list):
            list_2[j] = list[j]
            j+=1

        n=1
        if(list_2[0] == "REMARK" and list_2[1] == "800" and list_2[2] == "SITE_IDENTIFIER:"):
            n+=1
            print("Site", str(n) + ":", list_2[3])
            print("ok" + "\n")

As you can see, I am really a beginner. 如您所见,我真的是一个初学者。

Sorry about any grammar problems and thank you very much. 抱歉任何语法问题,非常感谢。

How about something like this: 这样的事情怎么样:

import re

f = open("3HDL.pdb", "r")

for line in f:
  m = re.search(r"REMARK 800 SITE_IDENTIFIER: (.+)", line)
  if m:
    site_id = m.group(1).strip()
  else:
    m = re.search(r"REMARK 800 EVIDENCE_CODE: (.+)", line)
    if m:
      evidence_code = m.group(1).strip()
    else:
      m = re.search(r"REMARK 800 SITE_DESCRIPTION: (.+)", line)
      if m:
        site_descrip = m.group(1).strip()
        print site_id, evidence_code, site_descrip

f.close()

Or, if you want to avoid using the regex module: 或者,如果您想避免使用正则表达式模块:

f = open("3HDL.pdb", "r")

for line in f:
  if line.startswith("REMARK 800"):
    if line.startswith("SITE_IDENTIFIER:", 11):
      site_id = line[28:].rstrip()
    elif line.startswith("EVIDENCE_CODE:", 11):
      evidence_code = line[26:].rstrip()
    elif line.startswith("SITE_DESCRIPTION:", 11):
      site_descrip = line[29:].rstrip()
      print site_id, evidence_code, site_descrip

f.close()

Here we assume the content wanted are the last word of line 2,3 and the last 3 words of line 4. 这里我们假设所需的内容是第2,3行的最后一个单词和第4行的最后3个单词。

name_file = "3HDL.pdb"
pdb_file = open(name_file,"r")
output = []
for linenum, line in enumerate(pdb_file):
    if linenum % 4 ==0:
        continue
    elif linenum % 4 == 1:
        output.append(line.split()[-1])
    elif linenum % 4 == 2:
        output.append(line.split()[-1])
    elif linenum % 4 == 3:
        output.extend(line.split()[-3:])
for i in range(len(output)/6):
    print ' '.join(output[i:i+6])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM