I have the following simplified data structure :
input = [("FileName1", "ID1", "Sequence1", 1000),
("FileName1", "ID1", "Sequence2", 500),
("FileName1", "ID2", "Sequence3", 1500),
("FileName1", "ID2", "Sequence5", 200),
("FileName2", "ID1", "Sequence1", 500),
("FileName2", "ID1", "Sequence2", 1000)
("FileName2", "ID2", "Sequence3", 250),
("FileName2", "ID2", "Sequence5", 2000)]
Here, a specific ID can be linked with several Sequences (not always the same number of Sequences attributed to a specific ID ) and several ID can be linked with one specific File Name (not always the same number of ID attributed to a specific FileName )
What I would like is to extract the triplet FileName/ID/Sequence with maximum intensity for each ID:
Output:
output = [("FileName1", "ID1", "Sequence1"),
("FileName1", "ID2", "Sequence3"),
("FileName2", "ID1", "Sequence2")
("FileName2", "ID2", "Sequence5")]
I need at the end one unique sequence (which had the maximum value) for each ID and to get at the same time the FileName because I need all of this information to map them to a dataframe afterwards.
FileNames will no longer have any duplicate ID and one unique sequence will be linked with a specific ID.
Thanks for your help
Using itertools
Ex:
import itertools
input = [("FileName1", "ID1", "Sequence1", 1000),
("FileName1", "ID1", "Sequence2", 500),
("FileName1", "ID2", "Sequence3", 1500),
("FileName1", "ID2", "Sequence5", 200),
("FileName2", "ID1", "Sequence1", 500),
("FileName2", "ID1", "Sequence2", 1000),
("FileName2", "ID2", "Sequence3", 250),
("FileName2", "ID2", "Sequence5", 2000)]
result = []
for k, v in itertools.groupby(input, lambda x: (x[0], x[1])):
result.append(max(list(v), key=lambda x: x[-1]))
# OR
# result = [max(list(v), key=lambda x: x[-1]) for k, v in itertools.groupby(input, lambda x: (x[0], x[1]))]
print(result)
Output
[('FileName1', 'ID1', 'Sequence1', 1000),
('FileName1', 'ID2', 'Sequence3', 1500),
('FileName2', 'ID1', 'Sequence2', 1000),
('FileName2', 'ID2', 'Sequence5', 2000)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.