简体   繁体   中英

How to collect and gather words from multiple lines in a input text (PDB file) using Python?

I am coding a Python script that collect words in a text file (PDB file), later gathering them in phrases. However, as I am just a beginner in programming, I'm having huge difficulties in doing it. I know how to do it just one line per time. I wish you guys could give me some help. Please.

The text has informations about sites of a protein. Each site has four dedicated lines of information, as you can see below:

REMARK 800  
REMARK 800 SITE_IDENTIFIER: CC1                                                 
REMARK 800 EVIDENCE_CODE: SOFTWARE                                              
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE EDO A 326                 
REMARK 800                                                                      
REMARK 800 SITE_IDENTIFIER: DF8                                                 
REMARK 800 EVIDENCE_CODE: AUTHOR                                             
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE HEM T 238
REMARK 800                                                                      
REMARK 800 SITE_IDENTIFIER: FC7                                                 
REMARK 800 EVIDENCE_CODE: SOFTWARE                                              
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE NAG D 1001 

#and so on ...

An extended exemple is seen in the following link (search for "REMARK 800"): http://www.pdb.org/pdb/files/3HDL.pdb

As observed,

  • The 1st line has nothing. (It just separates one information from the next)
  • the 2nd has the SITE_IDENTIFIER . (eg CC1)
  • the 3rd , the EVIDENCE_CODE . (eg SOFTWARE)
  • the 4th , some RESIDUE informations. (eg EDO A 326)

This pattern is seen in a great part of the text.

What I want to do is to gather some words from three of the four consecutive dedicated lines, in such way that they are put together in one phrase. The necessary informations are the SITE_IDENTIFIER , the EVIDENCE_CODE , and 3 words from SITE_DESCRIPTION . Thus, concerning the text excerpt above, the resulting phrases would be something like this:

CC1 SOFTWARE EDO A 326
DF8 AUTHOR HEM T 238
FC7 SOFTWARE NAG D 1001

#and so on...

Is it possible to do? If so, can you guys imagine how can I do this?

I tried doing it this way, but I feel like it is not going to work at all:

name_file = "3HDL.pdb"

pdb_file = open(name_file,"r")

for line in pdb_file:
    list = line.split()

    list_2=[]
    for j in range(0, 15):
        list_2.append("")

    if (list[0] == "REMARK" and list[1] == "800"):
        j=0
        while not j == len(list):
            list_2[j] = list[j]
            j+=1

        n=1
        if(list_2[0] == "REMARK" and list_2[1] == "800" and list_2[2] == "SITE_IDENTIFIER:"):
            n+=1
            print("Site", str(n) + ":", list_2[3])
            print("ok" + "\n")

As you can see, I am really a beginner.

Sorry about any grammar problems and thank you very much.

How about something like this:

import re

f = open("3HDL.pdb", "r")

for line in f:
  m = re.search(r"REMARK 800 SITE_IDENTIFIER: (.+)", line)
  if m:
    site_id = m.group(1).strip()
  else:
    m = re.search(r"REMARK 800 EVIDENCE_CODE: (.+)", line)
    if m:
      evidence_code = m.group(1).strip()
    else:
      m = re.search(r"REMARK 800 SITE_DESCRIPTION: (.+)", line)
      if m:
        site_descrip = m.group(1).strip()
        print site_id, evidence_code, site_descrip

f.close()

Or, if you want to avoid using the regex module:

f = open("3HDL.pdb", "r")

for line in f:
  if line.startswith("REMARK 800"):
    if line.startswith("SITE_IDENTIFIER:", 11):
      site_id = line[28:].rstrip()
    elif line.startswith("EVIDENCE_CODE:", 11):
      evidence_code = line[26:].rstrip()
    elif line.startswith("SITE_DESCRIPTION:", 11):
      site_descrip = line[29:].rstrip()
      print site_id, evidence_code, site_descrip

f.close()

Here we assume the content wanted are the last word of line 2,3 and the last 3 words of line 4.

name_file = "3HDL.pdb"
pdb_file = open(name_file,"r")
output = []
for linenum, line in enumerate(pdb_file):
    if linenum % 4 ==0:
        continue
    elif linenum % 4 == 1:
        output.append(line.split()[-1])
    elif linenum % 4 == 2:
        output.append(line.split()[-1])
    elif linenum % 4 == 3:
        output.extend(line.split()[-3:])
for i in range(len(output)/6):
    print ' '.join(output[i:i+6])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM