简体   繁体   中英

Finding common elements between two files

I have two different files as follows: file1.txt is tab-delimited

AT5G54940.1 3182
            pfam
            PF01253 SUI1#Translation initiation factor SUI1
            mf
            GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            bp
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   4996
                pfam
                PF01575 MaoC_dehydratas#MaoC like domain
                mf
                GO:0016491  oxidoreductase activity
                GO:0033989  3alpha,7alpha,
OS08T0174000-01 560919

and file2.txt that contains different protein names,

GRMZM2G158629_P02
AT5G54940.1
OS05T0566300-01
OS08T0174000-01

I need to run a program, that finds me proteins names that are present in file2 from file1 but also prints me all "GO:" that appertains to that protein, if applicable. The difficult part for me is parsing the 1st file..the format is strange. I tried something like this,but any other ways are very much appreciated,

import re
with open('file2.txt') as mylist:                                                      
proteins = set(line.strip() for line in mylist)                         

with open('file1.txt') as mydict:                           
    with open('a.txt', 'w') as output:                  
        for line in mydict:                                 
            new_list = line.strip().split()                         
            protein = new_list[0]                               
            if protein in proteins:
                if re.search(r'GO:\d+', line):
                    output.write(protein+'\t'+line)

Desired output,whichever format is OK as long as I have all corresponding GO's

AT5G54940.1 GO:0003743  translation initiation factor activity
            GO:0008135  translation factor activity, nucleic acid binding
            GO:0006413  translational initiation
            GO:0006412  translation
            GO:0044260  cellular macromolecule metabolic process
GRMZM2G158629_P02   GO:0016491  oxidoreductase activity
                    GO:0033989  3alpha,7alpha,
OS08T0174000-01

Just to give you an idea how you might want to tackle this. A "group" belonging to one protein in your input file is delimited by a change from indented lines to a non-indented one. Search for this transition and you have your groups (or "chunks"). The first line of a group contains the protein name. All other lines might be GO: lines.

You can detect indention by using if line.startswith(" ") (instead of " " you might look for "\\t" , depending on your input file format).

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

Here, a chunk is nothing but a list of stripped lines. I extract the protein chunks from the input file with a generator. As you can see, the logic is based only on the transition from indented line to non-indented line.

When using the generator you can do with the data whatever you want to. I simply printed it. However, you might want to put the data into a dictionary and do further analysis.

Output:

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

One option would be to build up a dictionary of lists, using the name of the protein as the key:

#!/usr/bin/env python

import pprint
pp = pprint.PrettyPrinter()

proteins = set(line.strip() for line in open('file2.txt'))
d = {}

with open('file1.txt') as file:
    for line in file:
        line = line.strip()
        parts = line.split()

        if parts[0] in proteins:
            key = parts[0]            
            d[key] = []                            
        elif parts[0].split(':')[0] == 'GO':
            d[key].append(line)

pp.pprint(d)

I've used the pprint module to print the dictionary, as you said you weren't too fussy about the format. The output as it stands is:

{'AT5G54940.1': ['GO:0003743  translation initiation factor activity',
                 'GO:0008135  translation factor activity, nucleic acid binding',
                 'GO:0006413  translational initiation',
                 'GO:0006412  translation',
                 'GO:0044260  cellular macromolecule metabolic process'],
 'GRMZM2G158629_P02': ['GO:0016491  oxidoreductase activity',
                       'GO:0033989  3alpha,7alpha,']}

edit

Instead of using pprint , you could obtain the output specified in the question using a loop:

with open('out.txt', 'w') as out:    
    for k,v in d.iteritems():        
        out.write('Protein: {}\n'.format(k))
        out.write('{}\n'.format('\n'.join(v)))

out.txt :

Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM