查找两个文件之间的共同元素

Question

我有两个不同的文件，如下所示：file1.txt是制表符分隔的

AT5G54940.1 3182
            pfam
            PF01253 SUI1#Translation initiation factor SUI1
            mf
            GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            bp
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   4996
                pfam
                PF01575 MaoC_dehydratas#MaoC like domain
                mf
                GO:0016491  oxidoreductase activity
                GO:0033989  3alpha,7alpha,
OS08T0174000-01 560919

和包含不同蛋白质名称的file2.txt，

GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01

我需要运行一个程序，该程序可以从文件1中找到文件2中存在的蛋白质名称，但也可以打印与该蛋白质相关的所有“ GO：”（如果适用）。 对我来说，困难的部分是解析第一个文件。格式很奇怪。 我尝试过类似的方法，但是非常感谢其他方法，

import re
with open('file2.txt') as mylist:                                                      
proteins = set(line.strip() for line in mylist)                         

with open('file1.txt') as mydict:                           
    with open('a.txt', 'w') as output:                  
        for line in mydict:                                 
            new_list = line.strip().split()                         
            protein = new_list[0]                               
            if protein in proteins:
                if re.search(r'GO:\d+', line):
                    output.write(protein+'\t'+line)

所需的输出，只要我具有所有对应的GO，就可以选择哪种格式都可以

AT5G54940.1 GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   GO:0016491  oxidoreductase activity
                    GO:0033989  3alpha,7alpha,
OS08T0174000-01

Answer 1

只是为了让您了解如何解决此问题。 输入文件中属于一种蛋白质的“组”由从缩进的行更改为非缩进的行来定界。 搜索此转换，您便有了组（或“组”）。 组的第一行包含蛋白质名称。 所有其他行可能是GO：行。

您可以使用if line.startswith(" ")来检测缩进if line.startswith(" ")取决于输入文件的格式，而不是" "可能会寻找"\\t" ）。

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

在这里，一个块不过是一条带状线的列表。 我使用生成器从输入文件中提取蛋白质块。 如您所见，逻辑仅基于从缩进线到非缩进线的过渡。

使用生成器时，您可以随意处理数据。 我只是打印出来。 但是，您可能希望将数据放入字典中并进行进一步分析。

输出：

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

Answer 2

一种选择是使用蛋白质的名称作为关键字来建立列表字典：

#!/usr/bin/env python

import pprint
pp = pprint.PrettyPrinter()

proteins = set(line.strip() for line in open('file2.txt'))
d = {}

with open('file1.txt') as file:
    for line in file:
        line = line.strip()
        parts = line.split()

        if parts[0] in proteins:
            key = parts[0]            
            d[key] = []                            
        elif parts[0].split(':')[0] == 'GO':
            d[key].append(line)

pp.pprint(d)

正如您所说的那样，我使用pprint模块来打印字典，因为您不太担心格式。 实际的输出是：

{'AT5G54940.1': ['GO:0003743  translation initiation factor activity',
                 'GO:0008135  translation factor activity, nucleic acid binding',
                 'GO:0006413  translational initiation',
                 'GO:0006412  translation',
                 'GO:0044260  cellular macromolecule metabolic process'],
 'GRMZM2G158629_P02': ['GO:0016491  oxidoreductase activity',
                       'GO:0033989  3alpha,7alpha,']}

编辑

除了使用pprint ，还可以使用循环获取问题中指定的输出：

with open('out.txt', 'w') as out:    
    for k,v in d.iteritems():        
        out.write('Protein: {}\n'.format(k))
        out.write('{}\n'.format('\n'.join(v)))

out.txt ：

Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process

查找两个文件之间的共同元素

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-06-26 09:47:07

解决方案2
1 2014-06-26 09:52:21

编辑

查找两个文件之间的共同元素

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-06-26 09:47:07

解决方案2 1 2014-06-26 09:52:21

编辑

解决方案1
2 已采纳 2014-06-26 09:47:07

解决方案2
1 2014-06-26 09:52:21