简体   繁体   中英

python combine each pair of lines: making script more efficient

I'm sorry for how basic this question is.

The aim: This is my output from a software program:

1   590 SC  1.000   LEU2_YEAST  100%
1   590 EC  1.000   LEU2_ECOLI  100%
2   467 SC  1.000   FADH_YEAST  100%
2   467 EC  1.000   ADH3_ECOLI  100%
3   463 SC  1.000   6PG1_YEAST  100%
3   463 SC  0.816   6PG2_YEAST
3   463 EC  1.000   6PGD_ECOLI  100%
3   463 EC  0.903   6PG9_ECOLI
4   446 SC  1.000   YME1_YEAST  59%
4   446 EC  1.000   FTSH_ECOLI  100%
5   411 SC  1.000   ADH4_YEAST  100%
5   411 EC  1.000   ADH2_ECOLI  99%
8   256 SC  1.000   ATM1_YEAST  100%
8   256 EC  1.000   HLYB_ECOLI  99%
8   256 EC  0.987   HLY2_ECOLI
9   252 SC  1.000   MDL2_YEAST  100%
9   252 SC  0.203   MDL1_YEAST
9   252 EC  1.000   MSBA_ECOLI  99%

For those with a biology background, I want to pull out ONLY the reciprocal best hits. For those with a non-biology background, I want to extract the pairs of genes, only if the number in the first column only appears twice.

For example, we can see the number 1 appears twice in the first column of the file:

 1  590 SC  1.000   LEU2_YEAST  100%
 1  590 EC  1.000   LEU2_ECOLI  100%

but the number 3 appears 4 times appears in the first column of the file:

3   463 SC  1.000   6PG1_YEAST  100%
3   463 SC  0.816   6PG2_YEAST
3   463 EC  1.000   6PGD_ECOLI  100%
3   463 EC  0.903   6PG9_ECOLI

So for this sample file, the output would look like this:

LEU2_YEAST LEU2_ECOLI
FADH_YEAST ADH3_ECOLI
YME1_YEAST FTSH_ECOLI
ADH4_YEAST ADH2_ECOLI

As these are the only four pairs of lines in the file.

This is my code:

import sys
Dict1 = {}
for line in open(sys.argv[1]):
    line = line.strip().split()
    if line[0] not in Dict1.keys():
        Dict1[line[0]] = [line[4]] 
    elif line[0] in Dict1.keys():
        Dict1[line[0]].append(line[4])

for i in Dict1.values():
    if len(i) == 2:
        print i[0] + "\t" + i[1] 

This works, the output it prints is:

LEU2_YEAST  LEU2_ECOLI
FADH_YEAST  ADH3_ECOLI
ADH4_YEAST  ADH2_ECOLI
YME1_YEAST  FTSH_ECOLI

I'm just curious as to how other people would do it? In reality, my actual data set will have thousands of lines, so I'm wondering if there's a more efficient (either in terms of time or memory) way of doing this? Or how people would add in "checks" to make sure the number only appears twice? At this stage, I have mastered python basics, so I'm looking into ways to design code better.

A possible improvement is to change the if line[0] not in Dict1.keys() to if line[0] not in Dict1 , since not in Dict1.keys() is a O(n) operation, whereas not in Dict is about O(1).

I'm not sure about the real performance gain. You should use time to figure that out.

If your file is sorted by the numbers in the first line, you can use itertools.groupby :

from itertools import groupby
import operator

with open(sys.argv[1]) as infile:
    # split lines and group them by the number in the first column
    groups= groupby([line.strip().split() for line in infile], operator.itemgetter(0))
# convert groups to lists and discard keys
groups= [list(lines) for _, lines in groups]
# discard groups that don't have 2 items and format the output
groups= ['%s\t%s'%(lines[0][4],lines[1][4]) for lines in groups if len(lines)==2]
# alternatively you can use
#   groups= ['\t'.join(zip(*lines)[4]) for lines in groups if len(lines)==2]

print '\n'.join(groups)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM