簡體   English   中英

查找兩個文件之間的共同元素

[英]Finding common elements between two files

我有兩個不同的文件,如下所示:file1.txt是制表符分隔的

AT5G54940.1 3182
            pfam
            PF01253 SUI1#Translation initiation factor SUI1
            mf
            GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            bp
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   4996
                pfam
                PF01575 MaoC_dehydratas#MaoC like domain
                mf
                GO:0016491  oxidoreductase activity
                GO:0033989  3alpha,7alpha,
OS08T0174000-01 560919

和包含不同蛋白質名稱的file2.txt,

GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01

我需要運行一個程序,該程序可以從文件1中找到文件2中存在的蛋白質名稱,但也可以打印與該蛋白質相關的所有“ GO:”(如果適用)。 對我來說,困難的部分是解析第一個文件。格式很奇怪。 我嘗試過類似的方法,但是非常感謝其他方法,

import re
with open('file2.txt') as mylist:                                                      
proteins = set(line.strip() for line in mylist)                         

with open('file1.txt') as mydict:                           
    with open('a.txt', 'w') as output:                  
        for line in mydict:                                 
            new_list = line.strip().split()                         
            protein = new_list[0]                               
            if protein in proteins:
                if re.search(r'GO:\d+', line):
                    output.write(protein+'\t'+line)

所需的輸出,只要我具有所有對應的GO,就可以選擇哪種格式都可以

AT5G54940.1 GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   GO:0016491  oxidoreductase activity
                    GO:0033989  3alpha,7alpha,
OS08T0174000-01

只是為了讓您了解如何解決此問題。 輸入文件中屬於一種蛋白質的“組”由從縮進的行更改為非縮進的行來定界。 搜索此轉換,您便有了組(或“組”)。 組的第一行包含蛋白質名稱。 所有其他行可能是GO:行。

您可以使用if line.startswith(" ")來檢測縮進if line.startswith(" ")取決於輸入文件的格式,而不是" "可能會尋找"\\t" )。

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

在這里,一個塊不過是一條帶狀線的列表。 我使用生成器從輸入文件中提取蛋白質塊。 如您所見,邏輯僅基於從縮進線到非縮進線的過渡。

使用生成器時,您可以隨意處理數據。 我只是打印出來。 但是,您可能希望將數據放入字典中並進行進一步分析。

輸出:

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

一種選擇是使用蛋白質的名稱作為關鍵字來建立列表字典:

#!/usr/bin/env python

import pprint
pp = pprint.PrettyPrinter()

proteins = set(line.strip() for line in open('file2.txt'))
d = {}

with open('file1.txt') as file:
    for line in file:
        line = line.strip()
        parts = line.split()

        if parts[0] in proteins:
            key = parts[0]            
            d[key] = []                            
        elif parts[0].split(':')[0] == 'GO':
            d[key].append(line)

pp.pprint(d)

正如您所說的那樣,我使用pprint模塊來打印字典,因為您不太擔心格式。 實際的輸出是:

{'AT5G54940.1': ['GO:0003743  translation initiation factor activity',
                 'GO:0008135  translation factor activity, nucleic acid binding',
                 'GO:0006413  translational initiation',
                 'GO:0006412  translation',
                 'GO:0044260  cellular macromolecule metabolic process'],
 'GRMZM2G158629_P02': ['GO:0016491  oxidoreductase activity',
                       'GO:0033989  3alpha,7alpha,']}

編輯

除了使用pprint ,還可以使用循環獲取問題中指定的輸出:

with open('out.txt', 'w') as out:    
    for k,v in d.iteritems():        
        out.write('Protein: {}\n'.format(k))
        out.write('{}\n'.format('\n'.join(v)))

out.txt

Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM