简体   繁体   English

使用python比较文件中不同行的两个单词

[英]Comparing two words from different lines in a file using python

I am working with a file from the protein data bank which looks something like this. 我正在使用蛋白质数据库中的文件,看起来像这样。

SITE     2 AC1 15 ASN A 306  LEU A 309  ILE A 310  PHE A 313                    
SITE     3 AC1 15 ARG A 316  LEU A 326  ALA A 327  ILE A 345                    
SITE     4 AC1 15 CYS A 432  HIS A 435  HOH A 504                               
CRYST1   64.511   64.511  111.465  90.00  90.00  90.00 P 43 21 2     8          
ORIGX1      1.000000  0.000000  0.000000        0.00000                         
ORIGX2      0.000000  1.000000  0.000000        0.00000                         
ORIGX3      0.000000  0.000000  1.000000        0.00000                         
SCALE1      0.015501  0.000000  0.000000        0.00000                         
SCALE2      0.000000  0.015501  0.000000        0.00000                         
SCALE3      0.000000  0.000000  0.008971        0.00000                         
ATOM      1  N   ASP A 229      29.461  51.231  44.569  1.00 47.64           N  
ATOM      2  CA  ASP A 229      29.341  51.990  43.290  1.00 47.13           C  
ATOM      3  C   ASP A 229      30.455  51.566  42.330  1.00 45.62           C  
ATOM      4  O   ASP A 229      31.598  51.376  42.743  1.00 47.18           O  
ATOM      5  CB  ASP A 229      29.433  53.493  43.567  1.00 49.27           C  
ATOM      6  CG  ASP A 229      28.817  54.329  42.463  1.00 51.26           C  
ATOM      7  OD1 ASP A 229      27.603  54.172  42.206  1.00 53.47           O  
ATOM      8  OD2 ASP A 229      29.542  55.145  41.856  1.00 52.96           O  
ATOM      9  N   MET A 230      30.119  51.424  41.051  1.00 41.99           N  
ATOM     10  CA  MET A 230      31.092  51.004  40.043  1.00 36.38           C  

First I needed to extract only the fourth column of the rows labeled ATOM, which is the amino acid sequence that specific atom is a part of. 首先,我只需要提取标记为ATOM的行的第四列,这是特定原子所属的氨基酸序列。 I have done that here. 我在这里做了。

import gzip
class Manual_Seq:

    def parseSeq(self, path):
        with gzip.open(path,'r') as file_content:
            for line in file_content:
                newLine = line.split(' ')[0]
                if newLine == 'ATOM':
                    AA = line[17]+line[18]+line[19]
                    print AA

Which produces an output of this 产生这个的输出

ASP
ASP
ASP
.....
MET

But what I need now, is to output only the first ASP and the first MET and etc and concatenate them so it'll look like this. 但是我现在需要的是仅输出第一个ASP和第一个MET等并将它们连接起来,这样看起来就这样。

ASPMET

I was thinking maybe I'll try to iterate ahead one line and compare it until it is different from the first output, but I am unsure of how I would do this, if you have any other ideas or any improvements to my code please do feel free to submit your suggestions, thanks. 我在想,也许我会尝试在一行上进行迭代并进行比较,直到它与第一个输出有所不同为止,但是我不确定我将如何执行此操作,如果您对我的代码有任何其他想法或任何改进,请这样做随时提交您的建议,谢谢。 I also need to mention that there can in fact be two identical amino acids in one file so the output could be "ASP MET ASP" 我还需要提及的是,实际上在一个文件中可以有两个相同的氨基酸,因此输出可以是“ ASP MET ASP”

Instead of printing them, make a list, so 而不是打印它们,而是列出列表,这样

print AA

Becomes

my_list.append(AA)

Just don't forget to initialize the list before the loop with my_list=[] 只是不要忘记在循环之前使用my_list=[]初始化列表

Now that you have all those values, you can loop through them and make a string out of the unique values. 现在,您已经拥有了所有这些值,可以遍历它们并从唯一值中创建一个字符串。 If the order doesn't matter to you than you can use set like this: 如果顺序对您而言无关紧要,则可以使用如下所示的set

my_string = ''.join(set(my_list))

But if the order is important, you have to loop through that list: 但是,如果顺序很重要,则必须遍历该列表:

my_string = ''
seen = []
for item in my_list:
    if item not in seen:
        seen.append(item)
        my_string += item

You could do it without the seen list, but that would be risky 您可以在没有seen清单的情况下进行操作,但这会带来风险

Anyway, all that means you are looping twice on the same data, which is not needed. 无论如何,所有这些都意味着您在同一数据上循环了两次,这是不必要的。 Instead of all of this, you could initialize my_string='' and seen=[] before your main loop, and do what I did inside your loop instead of print AA ... That would look like this: 代替所有这些,您可以在主循环之前初始化my_string=''seen=[] ,然后执行我在循环内执行的操作,而不是print AA ...看起来像这样:

def parseSeq(self, path):
    with gzip.open(path,'r') as file_content:
        my_string = ''
        seen = []
        for line in file_content:
            newLine = line.split(' ')[0]
            if newLine == 'ATOM':
                AA = line[17]+line[18]+line[19]
                if AA not in seen:
                    seen.append(AA)
                    my_string += AA
        return my_string # or print my_string

I added a bit of code to your existing code: 我在现有代码中添加了一些代码:

import gzip
class Manual_Seq:

def parseSeq(self, path):
    with gzip.open(path,'r') as file_content:

Here we define an empty list, called AAs to hold your amino acids. 在这里,我们定义了一个空列表,称为AA,用于保存您的氨基酸。

        AAs = []
        for line in file_content:

Next, I generalized your code a bit to split the line into fields so that we can extract various fields, as needed. 接下来,我对您的代码进行了一些概括,以将行分成多个字段,以便我们可以根据需要提取各个字段。

            fields = line.split(' ')
            line_index = fields[0]
            if line_index == 'ATOM':

He we check to see if the amino acid is already in the list of amino acids... If not, then we add the amino acid to the list... This has the effect of deduplicating the amino acids. 我们检查一下氨基酸列表中是否已存在该氨基酸...如果不是,则将氨基酸添加到该列表中...这具有对氨基酸进行重复数据删除的作用。

                if fields[3] not in AAs:
                    AAs.append(fields[3])

Lastly, we concatenate all the values into a single value using the empty string '' and the join() method. 最后,我们使用空字符串''join()方法将所有值连接为一个值。

    return ''.join(AAs)            

Just wondering did you consider using this BioPandas? 只是想知道您是否考虑使用此BioPandas?

https://rasbt.github.io/biopandas/tutorials/Working_with_PDB_Structures_in_DataFrames/ https://rasbt.github.io/biopandas/tutorials/Working_with_PDB_Structures_in_DataFrames/

It should be easier to do what you want to do using pandas. 使用熊猫来做您想做的事应该会更容易。 You just need to use: 您只需要使用:

df.column_name.unique()

and then concantenate the string in the list using "".join(list_name) https://docs.python.org/3/library/stdtypes.html#str.join 然后使用"".join(list_name) https://docs.python.org/3/library/stdtypes.html#str.join合并列表中的字符串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较csv文件中的两行 - Python - Comparing two lines in csv file - Python 使用Python读取和比较文件中的行 - Reading and comparing lines in a file using Python 比较两个csv文件后,如何在spesific列中打印不同的行。 使用python或bash - After comparing two csv files, how to print the different lines in spesific column. Using python or Bash Python-比较两个电子邮件列表,其中存在电子邮件但行不同 - Python - Comparing two lists of emails where emails exist but in different lines Python:比较两个文本文件中具有不同列数的部分行 - Python: comparing part of lines in two text files with different number of columns 使用python在文本文件中的两行中减去时间 - subtraction of time in two different lines in a text file using python 如何使用python比较来自两个不同文件的字符串? - How to compare strings of lines from two different files using python? 比较来自不同文件的单词 - Comparing words from different files Python Regex:如何使用正则表达式读取多行文件,并从每行中提取单词以创建两个不同的列表 - Python Regex: How do I use regular expression to read in a file with multiple lines, and extract words from each line to create two different lists 如何使用python从不同的文本文件中删除相同的单词? - How to delete same words from different text file using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM