简体   繁体   中英

Extract string showing maximum value in a list of tuple with duplicated elements

I have the following simplified data structure :

input = [("FileName1", "ID1", "Sequence1", 1000),
         ("FileName1", "ID1", "Sequence2", 500),
         ("FileName1", "ID2", "Sequence3", 1500),
         ("FileName1", "ID2", "Sequence5", 200),
         ("FileName2", "ID1", "Sequence1", 500),
         ("FileName2", "ID1", "Sequence2", 1000)
         ("FileName2", "ID2", "Sequence3", 250),
         ("FileName2", "ID2", "Sequence5", 2000)]

Here, a specific ID can be linked with several Sequences (not always the same number of Sequences attributed to a specific ID ) and several ID can be linked with one specific File Name (not always the same number of ID attributed to a specific FileName )

What I would like is to extract the triplet FileName/ID/Sequence with maximum intensity for each ID:

Output:

output = [("FileName1", "ID1", "Sequence1"),
          ("FileName1", "ID2", "Sequence3"),
          ("FileName2", "ID1", "Sequence2")
          ("FileName2", "ID2", "Sequence5")]

I need at the end one unique sequence (which had the maximum value) for each ID and to get at the same time the FileName because I need all of this information to map them to a dataframe afterwards.

FileNames will no longer have any duplicate ID and one unique sequence will be linked with a specific ID.

Thanks for your help

Using itertools

Ex:

import itertools

input = [("FileName1", "ID1", "Sequence1", 1000),
         ("FileName1", "ID1", "Sequence2", 500),
         ("FileName1", "ID2", "Sequence3", 1500),
         ("FileName1", "ID2", "Sequence5", 200),
         ("FileName2", "ID1", "Sequence1", 500),
         ("FileName2", "ID1", "Sequence2", 1000),
         ("FileName2", "ID2", "Sequence3", 250),
         ("FileName2", "ID2", "Sequence5", 2000)]


result = []
for k, v in itertools.groupby(input, lambda x: (x[0], x[1])):
    result.append(max(list(v), key=lambda x: x[-1]))

# OR
# result = [max(list(v), key=lambda x: x[-1]) for k, v in itertools.groupby(input, lambda x: (x[0], x[1]))]  
    
print(result)

Output

[('FileName1', 'ID1', 'Sequence1', 1000),
 ('FileName1', 'ID2', 'Sequence3', 1500),
 ('FileName2', 'ID1', 'Sequence2', 1000),
 ('FileName2', 'ID2', 'Sequence5', 2000)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM