Extract string showing maximum value in a list of tuple with duplicated elements

Question

I have the following simplified data structure :

input = [("FileName1", "ID1", "Sequence1", 1000),
         ("FileName1", "ID1", "Sequence2", 500),
         ("FileName1", "ID2", "Sequence3", 1500),
         ("FileName1", "ID2", "Sequence5", 200),
         ("FileName2", "ID1", "Sequence1", 500),
         ("FileName2", "ID1", "Sequence2", 1000)
         ("FileName2", "ID2", "Sequence3", 250),
         ("FileName2", "ID2", "Sequence5", 2000)]

Here, a specific ID can be linked with several Sequences (not always the same number of Sequences attributed to a specific ID ) and several ID can be linked with one specific File Name (not always the same number of ID attributed to a specific FileName )

What I would like is to extract the triplet FileName/ID/Sequence with maximum intensity for each ID:

Output:

output = [("FileName1", "ID1", "Sequence1"),
          ("FileName1", "ID2", "Sequence3"),
          ("FileName2", "ID1", "Sequence2")
          ("FileName2", "ID2", "Sequence5")]

I need at the end one unique sequence (which had the maximum value) for each ID and to get at the same time the FileName because I need all of this information to map them to a dataframe afterwards.

FileNames will no longer have any duplicate ID and one unique sequence will be linked with a specific ID.

Thanks for your help

Answer 1

Using itertools

Ex:

import itertools

input = [("FileName1", "ID1", "Sequence1", 1000),
         ("FileName1", "ID1", "Sequence2", 500),
         ("FileName1", "ID2", "Sequence3", 1500),
         ("FileName1", "ID2", "Sequence5", 200),
         ("FileName2", "ID1", "Sequence1", 500),
         ("FileName2", "ID1", "Sequence2", 1000),
         ("FileName2", "ID2", "Sequence3", 250),
         ("FileName2", "ID2", "Sequence5", 2000)]


result = []
for k, v in itertools.groupby(input, lambda x: (x[0], x[1])):
    result.append(max(list(v), key=lambda x: x[-1]))

# OR
# result = [max(list(v), key=lambda x: x[-1]) for k, v in itertools.groupby(input, lambda x: (x[0], x[1]))]  
    
print(result)

Output

[('FileName1', 'ID1', 'Sequence1', 1000),
 ('FileName1', 'ID2', 'Sequence3', 1500),
 ('FileName2', 'ID1', 'Sequence2', 1000),
 ('FileName2', 'ID2', 'Sequence5', 2000)]

Extract string showing maximum value in a list of tuple with duplicated elements

Question

1 answers

solution1
1 ACCPTED 2022-06-10 15:35:46

Extract string showing maximum value in a list of tuple with duplicated elements

Question

1 answers

solution1 1 ACCPTED 2022-06-10 15:35:46

solution1
1 ACCPTED 2022-06-10 15:35:46