简体   繁体   中英

Python: Removing duplicates from col 1,2 and printing col 3 values on 1 line

I have a file with AA sequences in column 1, and in column two, the number of times they appear, which I created using Counter(). In column three I have numerical values, which are all different. The items in col 1 and col 2 can be identical.

Ex. Input file:

ADVAEDY         28      0.17805
ADVAEDY         28      0.17365
ADVAEDY         28      0.16951
...
ARYLGYNSNWYPFDY         23      4.16148
ARYLGYNSNWYPFDY         23      3.17716
ARYLGYNSNWYPFDY         23      1.74919
...
ARHLGYNSAWYPFDY         21      10.6038
ARHLGYNSAWYPFDY         21      2.3498
ARHLGYNSAWYPFDY         21      1.68818
...
AGIAFDY         20      0.457553
AGIAFDY         20      0.416321
AGIAFDY         20      0.286349
...
ATIEDH  4       2.45283
ATIEDH  4       0.553351
ATIEDH  4       0.441266

So there is 197 lines in this file. There are only 48 unique AA sequences from col 1. The code that generated this file:

input_fh = sys.argv[1] # File containing all CDR(x)
cdr_spec = sys.argv[2] # File containing CDR(x) in one column and specificities in the second

with open(input_fh, "r") as f1:
        cdr = [line.strip() for line in f1]

with open(cdr_spec, "r") as f2:
        cdr_spec_list = [line.strip().split() for line in f2]

cdr_spec_out = open("CDR" + c + "_counts_spec.txt", "w")

counter_cdr = Counter(cdr)
countermc_cdr = counter_cdr.most_common()

print len(countermc_cdr)
#This one might work:
for k,v in countermc_cdr:
        for x,y in cdr_spec_list:
                if k == x:
                        print >> cdr_spec_out, k, '\t', v, '\t', y

cdr_spec_out.close()

The output I want to generate is,using the example above by removing duplicates in col 1 and 2 but keeping all mtaching values in col 3 on one line:

ADVAEDY         28      0.17805, 0.17365, 0.16951
    ...
    ARYLGYNSNWYPFDY         23      4.16148, 3.17716, 1.74919
    ...
    ARHLGYNSAWYPFDY         21      10.6038, 2.3498, 1.68818
    ...
    AGIAFDY         20      0.457553, 0.416321, 0.286349
    ...
    ATIEDH  4       2.45283, 0.553351, 0.441266

Also, for each comma separated value for the "new" col 3 I would need them to be in order of largest to smallest. I would prefer to stay away from modules, as I'm still learning python and the "pythonic" way of doing things.

Any help is appreciated.

What causes the same AA to be printed additional times is the second for loop:

    for x,y in cdr_spec_list:

try to load the cdr_spec_list from the start as a dictionary:

with open(cdr_spec, "r") as f2:
    cdr_spec_dic = defaultdict(list) #a dictionary with the default value of list
    for ln in f2:
        k,v = ln.strip().split()
        cdr_spec_dic[k].append(v)

Now you have a dictionary from each AA sequence to the numerical values you're presenting. So now, we don't need the second for loop, and we can also sort while we're there.

for k,v in countermc_cdr:
       print >> cdr_spec_out, k, '\t', v, '\t', ' '.join(sorted(cdr_spec_dic[k]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM