[英]Comparing two words from different lines in a file using python
I am working with a file from the protein data bank which looks something like this. 我正在使用蛋白质数据库中的文件,看起来像这样。
SITE 2 AC1 15 ASN A 306 LEU A 309 ILE A 310 PHE A 313
SITE 3 AC1 15 ARG A 316 LEU A 326 ALA A 327 ILE A 345
SITE 4 AC1 15 CYS A 432 HIS A 435 HOH A 504
CRYST1 64.511 64.511 111.465 90.00 90.00 90.00 P 43 21 2 8
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.015501 0.000000 0.000000 0.00000
SCALE2 0.000000 0.015501 0.000000 0.00000
SCALE3 0.000000 0.000000 0.008971 0.00000
ATOM 1 N ASP A 229 29.461 51.231 44.569 1.00 47.64 N
ATOM 2 CA ASP A 229 29.341 51.990 43.290 1.00 47.13 C
ATOM 3 C ASP A 229 30.455 51.566 42.330 1.00 45.62 C
ATOM 4 O ASP A 229 31.598 51.376 42.743 1.00 47.18 O
ATOM 5 CB ASP A 229 29.433 53.493 43.567 1.00 49.27 C
ATOM 6 CG ASP A 229 28.817 54.329 42.463 1.00 51.26 C
ATOM 7 OD1 ASP A 229 27.603 54.172 42.206 1.00 53.47 O
ATOM 8 OD2 ASP A 229 29.542 55.145 41.856 1.00 52.96 O
ATOM 9 N MET A 230 30.119 51.424 41.051 1.00 41.99 N
ATOM 10 CA MET A 230 31.092 51.004 40.043 1.00 36.38 C
First I needed to extract only the fourth column of the rows labeled ATOM, which is the amino acid sequence that specific atom is a part of. 首先,我只需要提取标记为ATOM的行的第四列,这是特定原子所属的氨基酸序列。 I have done that here.
我在这里做了。
import gzip
class Manual_Seq:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
for line in file_content:
newLine = line.split(' ')[0]
if newLine == 'ATOM':
AA = line[17]+line[18]+line[19]
print AA
Which produces an output of this 产生这个的输出
ASP
ASP
ASP
.....
MET
But what I need now, is to output only the first ASP and the first MET and etc and concatenate them so it'll look like this. 但是我现在需要的是仅输出第一个ASP和第一个MET等并将它们连接起来,这样看起来就这样。
ASPMET
I was thinking maybe I'll try to iterate ahead one line and compare it until it is different from the first output, but I am unsure of how I would do this, if you have any other ideas or any improvements to my code please do feel free to submit your suggestions, thanks. 我在想,也许我会尝试在一行上进行迭代并进行比较,直到它与第一个输出有所不同为止,但是我不确定我将如何执行此操作,如果您对我的代码有任何其他想法或任何改进,请这样做随时提交您的建议,谢谢。 I also need to mention that there can in fact be two identical amino acids in one file so the output could be "ASP MET ASP"
我还需要提及的是,实际上在一个文件中可以有两个相同的氨基酸,因此输出可以是“ ASP MET ASP”
Instead of printing them, make a list, so 而不是打印它们,而是列出列表,这样
print AA
Becomes 变
my_list.append(AA)
Just don't forget to initialize the list before the loop with my_list=[]
只是不要忘记在循环之前使用
my_list=[]
初始化列表
Now that you have all those values, you can loop through them and make a string out of the unique values. 现在,您已经拥有了所有这些值,可以遍历它们并从唯一值中创建一个字符串。 If the order doesn't matter to you than you can use
set
like this: 如果顺序对您而言无关紧要,则可以使用如下所示的
set
:
my_string = ''.join(set(my_list))
But if the order is important, you have to loop through that list: 但是,如果顺序很重要,则必须遍历该列表:
my_string = ''
seen = []
for item in my_list:
if item not in seen:
seen.append(item)
my_string += item
You could do it without the seen
list, but that would be risky 您可以在没有
seen
清单的情况下进行操作,但这会带来风险
Anyway, all that means you are looping twice on the same data, which is not needed. 无论如何,所有这些都意味着您在同一数据上循环了两次,这是不必要的。 Instead of all of this, you could initialize
my_string=''
and seen=[]
before your main loop, and do what I did inside your loop instead of print AA
... That would look like this: 代替所有这些,您可以在主循环之前初始化
my_string=''
和seen=[]
,然后执行我在循环内执行的操作,而不是print AA
...看起来像这样:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
my_string = ''
seen = []
for line in file_content:
newLine = line.split(' ')[0]
if newLine == 'ATOM':
AA = line[17]+line[18]+line[19]
if AA not in seen:
seen.append(AA)
my_string += AA
return my_string # or print my_string
I added a bit of code to your existing code: 我在现有代码中添加了一些代码:
import gzip
class Manual_Seq:
def parseSeq(self, path):
with gzip.open(path,'r') as file_content:
Here we define an empty list, called AAs to hold your amino acids. 在这里,我们定义了一个空列表,称为AA,用于保存您的氨基酸。
AAs = []
for line in file_content:
Next, I generalized your code a bit to split the line into fields so that we can extract various fields, as needed. 接下来,我对您的代码进行了一些概括,以将行分成多个字段,以便我们可以根据需要提取各个字段。
fields = line.split(' ')
line_index = fields[0]
if line_index == 'ATOM':
He we check to see if the amino acid is already in the list of amino acids... If not, then we add the amino acid to the list... This has the effect of deduplicating the amino acids. 我们检查一下氨基酸列表中是否已存在该氨基酸...如果不是,则将氨基酸添加到该列表中...这具有对氨基酸进行重复数据删除的作用。
if fields[3] not in AAs:
AAs.append(fields[3])
Lastly, we concatenate all the values into a single value using the empty string ''
and the join()
method. 最后,我们使用空字符串
''
和join()
方法将所有值连接为一个值。
return ''.join(AAs)
Just wondering did you consider using this BioPandas? 只是想知道您是否考虑使用此BioPandas?
https://rasbt.github.io/biopandas/tutorials/Working_with_PDB_Structures_in_DataFrames/ https://rasbt.github.io/biopandas/tutorials/Working_with_PDB_Structures_in_DataFrames/
It should be easier to do what you want to do using pandas. 使用熊猫来做您想做的事应该会更容易。 You just need to use:
您只需要使用:
df.column_name.unique()
and then concantenate the string in the list using "".join(list_name)
https://docs.python.org/3/library/stdtypes.html#str.join 然后使用
"".join(list_name)
https://docs.python.org/3/library/stdtypes.html#str.join合并列表中的字符串
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.