I am coding a Python script that collect words in a text file (PDB file), later gathering them in phrases. However, as I am just a beginner in programming, I'm having huge difficulties in doing it. I know how to do it just one line per time. I wish you guys could give me some help. Please.
The text has informations about sites of a protein. Each site has four dedicated lines of information, as you can see below:
REMARK 800
REMARK 800 SITE_IDENTIFIER: CC1
REMARK 800 EVIDENCE_CODE: SOFTWARE
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE EDO A 326
REMARK 800
REMARK 800 SITE_IDENTIFIER: DF8
REMARK 800 EVIDENCE_CODE: AUTHOR
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE HEM T 238
REMARK 800
REMARK 800 SITE_IDENTIFIER: FC7
REMARK 800 EVIDENCE_CODE: SOFTWARE
REMARK 800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE NAG D 1001
#and so on ...
An extended exemple is seen in the following link (search for "REMARK 800"): http://www.pdb.org/pdb/files/3HDL.pdb
As observed,
This pattern is seen in a great part of the text.
What I want to do is to gather some words from three of the four consecutive dedicated lines, in such way that they are put together in one phrase. The necessary informations are the SITE_IDENTIFIER , the EVIDENCE_CODE , and 3 words from SITE_DESCRIPTION . Thus, concerning the text excerpt above, the resulting phrases would be something like this:
CC1 SOFTWARE EDO A 326
DF8 AUTHOR HEM T 238
FC7 SOFTWARE NAG D 1001
#and so on...
Is it possible to do? If so, can you guys imagine how can I do this?
I tried doing it this way, but I feel like it is not going to work at all:
name_file = "3HDL.pdb"
pdb_file = open(name_file,"r")
for line in pdb_file:
list = line.split()
list_2=[]
for j in range(0, 15):
list_2.append("")
if (list[0] == "REMARK" and list[1] == "800"):
j=0
while not j == len(list):
list_2[j] = list[j]
j+=1
n=1
if(list_2[0] == "REMARK" and list_2[1] == "800" and list_2[2] == "SITE_IDENTIFIER:"):
n+=1
print("Site", str(n) + ":", list_2[3])
print("ok" + "\n")
As you can see, I am really a beginner.
Sorry about any grammar problems and thank you very much.
How about something like this:
import re
f = open("3HDL.pdb", "r")
for line in f:
m = re.search(r"REMARK 800 SITE_IDENTIFIER: (.+)", line)
if m:
site_id = m.group(1).strip()
else:
m = re.search(r"REMARK 800 EVIDENCE_CODE: (.+)", line)
if m:
evidence_code = m.group(1).strip()
else:
m = re.search(r"REMARK 800 SITE_DESCRIPTION: (.+)", line)
if m:
site_descrip = m.group(1).strip()
print site_id, evidence_code, site_descrip
f.close()
Or, if you want to avoid using the regex module:
f = open("3HDL.pdb", "r")
for line in f:
if line.startswith("REMARK 800"):
if line.startswith("SITE_IDENTIFIER:", 11):
site_id = line[28:].rstrip()
elif line.startswith("EVIDENCE_CODE:", 11):
evidence_code = line[26:].rstrip()
elif line.startswith("SITE_DESCRIPTION:", 11):
site_descrip = line[29:].rstrip()
print site_id, evidence_code, site_descrip
f.close()
Here we assume the content wanted are the last word of line 2,3 and the last 3 words of line 4.
name_file = "3HDL.pdb"
pdb_file = open(name_file,"r")
output = []
for linenum, line in enumerate(pdb_file):
if linenum % 4 ==0:
continue
elif linenum % 4 == 1:
output.append(line.split()[-1])
elif linenum % 4 == 2:
output.append(line.split()[-1])
elif linenum % 4 == 3:
output.extend(line.split()[-3:])
for i in range(len(output)/6):
print ' '.join(output[i:i+6])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.