I have a file with AA sequences in column 1, and in column two, the number of times they appear, which I created using Counter(). In column three I have numerical values, which are all different. The items in col 1 and col 2 can be identical.
Ex. Input file:
ADVAEDY 28 0.17805
ADVAEDY 28 0.17365
ADVAEDY 28 0.16951
...
ARYLGYNSNWYPFDY 23 4.16148
ARYLGYNSNWYPFDY 23 3.17716
ARYLGYNSNWYPFDY 23 1.74919
...
ARHLGYNSAWYPFDY 21 10.6038
ARHLGYNSAWYPFDY 21 2.3498
ARHLGYNSAWYPFDY 21 1.68818
...
AGIAFDY 20 0.457553
AGIAFDY 20 0.416321
AGIAFDY 20 0.286349
...
ATIEDH 4 2.45283
ATIEDH 4 0.553351
ATIEDH 4 0.441266
So there is 197 lines in this file. There are only 48 unique AA sequences from col 1. The code that generated this file:
input_fh = sys.argv[1] # File containing all CDR(x)
cdr_spec = sys.argv[2] # File containing CDR(x) in one column and specificities in the second
with open(input_fh, "r") as f1:
cdr = [line.strip() for line in f1]
with open(cdr_spec, "r") as f2:
cdr_spec_list = [line.strip().split() for line in f2]
cdr_spec_out = open("CDR" + c + "_counts_spec.txt", "w")
counter_cdr = Counter(cdr)
countermc_cdr = counter_cdr.most_common()
print len(countermc_cdr)
#This one might work:
for k,v in countermc_cdr:
for x,y in cdr_spec_list:
if k == x:
print >> cdr_spec_out, k, '\t', v, '\t', y
cdr_spec_out.close()
The output I want to generate is,using the example above by removing duplicates in col 1 and 2 but keeping all mtaching values in col 3 on one line:
ADVAEDY 28 0.17805, 0.17365, 0.16951
...
ARYLGYNSNWYPFDY 23 4.16148, 3.17716, 1.74919
...
ARHLGYNSAWYPFDY 21 10.6038, 2.3498, 1.68818
...
AGIAFDY 20 0.457553, 0.416321, 0.286349
...
ATIEDH 4 2.45283, 0.553351, 0.441266
Also, for each comma separated value for the "new" col 3 I would need them to be in order of largest to smallest. I would prefer to stay away from modules, as I'm still learning python and the "pythonic" way of doing things.
Any help is appreciated.
What causes the same AA to be printed additional times is the second for loop:
for x,y in cdr_spec_list:
try to load the cdr_spec_list from the start as a dictionary:
with open(cdr_spec, "r") as f2:
cdr_spec_dic = defaultdict(list) #a dictionary with the default value of list
for ln in f2:
k,v = ln.strip().split()
cdr_spec_dic[k].append(v)
Now you have a dictionary from each AA sequence to the numerical values you're presenting. So now, we don't need the second for loop, and we can also sort while we're there.
for k,v in countermc_cdr:
print >> cdr_spec_out, k, '\t', v, '\t', ' '.join(sorted(cdr_spec_dic[k]))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.