I'm sorry for how basic this question is.
The aim: This is my output from a software program:
1 590 SC 1.000 LEU2_YEAST 100%
1 590 EC 1.000 LEU2_ECOLI 100%
2 467 SC 1.000 FADH_YEAST 100%
2 467 EC 1.000 ADH3_ECOLI 100%
3 463 SC 1.000 6PG1_YEAST 100%
3 463 SC 0.816 6PG2_YEAST
3 463 EC 1.000 6PGD_ECOLI 100%
3 463 EC 0.903 6PG9_ECOLI
4 446 SC 1.000 YME1_YEAST 59%
4 446 EC 1.000 FTSH_ECOLI 100%
5 411 SC 1.000 ADH4_YEAST 100%
5 411 EC 1.000 ADH2_ECOLI 99%
8 256 SC 1.000 ATM1_YEAST 100%
8 256 EC 1.000 HLYB_ECOLI 99%
8 256 EC 0.987 HLY2_ECOLI
9 252 SC 1.000 MDL2_YEAST 100%
9 252 SC 0.203 MDL1_YEAST
9 252 EC 1.000 MSBA_ECOLI 99%
For those with a biology background, I want to pull out ONLY the reciprocal best hits. For those with a non-biology background, I want to extract the pairs of genes, only if the number in the first column only appears twice.
For example, we can see the number 1 appears twice in the first column of the file:
1 590 SC 1.000 LEU2_YEAST 100%
1 590 EC 1.000 LEU2_ECOLI 100%
but the number 3 appears 4 times appears in the first column of the file:
3 463 SC 1.000 6PG1_YEAST 100%
3 463 SC 0.816 6PG2_YEAST
3 463 EC 1.000 6PGD_ECOLI 100%
3 463 EC 0.903 6PG9_ECOLI
So for this sample file, the output would look like this:
LEU2_YEAST LEU2_ECOLI
FADH_YEAST ADH3_ECOLI
YME1_YEAST FTSH_ECOLI
ADH4_YEAST ADH2_ECOLI
As these are the only four pairs of lines in the file.
This is my code:
import sys
Dict1 = {}
for line in open(sys.argv[1]):
line = line.strip().split()
if line[0] not in Dict1.keys():
Dict1[line[0]] = [line[4]]
elif line[0] in Dict1.keys():
Dict1[line[0]].append(line[4])
for i in Dict1.values():
if len(i) == 2:
print i[0] + "\t" + i[1]
This works, the output it prints is:
LEU2_YEAST LEU2_ECOLI
FADH_YEAST ADH3_ECOLI
ADH4_YEAST ADH2_ECOLI
YME1_YEAST FTSH_ECOLI
I'm just curious as to how other people would do it? In reality, my actual data set will have thousands of lines, so I'm wondering if there's a more efficient (either in terms of time or memory) way of doing this? Or how people would add in "checks" to make sure the number only appears twice? At this stage, I have mastered python basics, so I'm looking into ways to design code better.
A possible improvement is to change the if line[0] not in Dict1.keys()
to if line[0] not in Dict1
, since not in Dict1.keys()
is a O(n) operation, whereas not in Dict
is about O(1).
I'm not sure about the real performance gain. You should use time to figure that out.
If your file is sorted by the numbers in the first line, you can use itertools.groupby
:
from itertools import groupby
import operator
with open(sys.argv[1]) as infile:
# split lines and group them by the number in the first column
groups= groupby([line.strip().split() for line in infile], operator.itemgetter(0))
# convert groups to lists and discard keys
groups= [list(lines) for _, lines in groups]
# discard groups that don't have 2 items and format the output
groups= ['%s\t%s'%(lines[0][4],lines[1][4]) for lines in groups if len(lines)==2]
# alternatively you can use
# groups= ['\t'.join(zip(*lines)[4]) for lines in groups if len(lines)==2]
print '\n'.join(groups)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.