[英]Retrieving the top value in a dictionary that has multiple values under a single key
I am somewhat new to python and i have a problem. 我对python有点陌生,我遇到了问题。 I have a file with 5 results for each unique identifier. 我有一个文件,每个唯一标识符有5个结果。 Each result has a percent match, and various other pieces of data. 每个结果都有一个百分比匹配,以及其他各种数据。 My goal is to find the result with the greatest percent match, and then retrieve more information from that original line. 我的目标是找到匹配百分比最高的结果,然后从该原始行中检索更多信息。 For example 例如
Name Organism Percent Match Misc info
1 Human 100 xxx
1 Goat 95 yyy
1 Pig 90 zzz
I am attempting to solve this problem by putting each key in a dictionary with the values being each percent match unique to the given name (ie multiple values for every key). 我试图通过将每个键放入字典中来解决此问题,每个值的百分比都与给定名称唯一匹配(即每个键有多个值)。 The only way I can think to proceed is to convert the values in this dictionary to a list, then sort the list. 我认为可以进行的唯一方法是将此字典中的值转换为列表,然后对列表进行排序。 I then want to retrieve the greatest value in the list (list[0] or list[-1]) and then retrieve more info from the original line. 然后,我想检索列表(list [0]或list [-1])中的最大值,然后从原始行中检索更多信息。 Here is my code thus far 到目前为止,这是我的代码
list = []
if "1" in line:
id = line
bsp = id.split("\t")
uid = bsp[0]
per = bsp[2]
if not dict.has_key(uid):
dict[uid] = []
dict[uid].append(per)
list = dict[uid]
list.sort()
if list[0] in dict:
print key
This ends up just printing every key, as opposed to only that which has the greatest percent. 最终仅打印每个键,而不是仅打印百分比最高的键。 Any thoughts? 有什么想法吗? Thanks! 谢谢!
You could use csv
to parse the tab-delineated data file, (though the data you posted looks to be column-spaced data!?) 您可以使用csv
来解析制表符描述的数据文件,(尽管您发布的数据看起来是按列分隔的数据!?)
Since the first line in your data file gives field names, a DictReader is convenient, so you can refer to the columns by human-readable names. 由于数据文件中的第一行提供了字段名称,因此DictReader很方便,因此您可以使用易于理解的名称来引用列。
csv.DictReader
returns an iterable of rows (dicts). csv.DictReader
返回一个可迭代的行(字典)。 If you take the max
of the iterable using the Percent Match
column as the key
, you can find the row with the highest percent match: 如果使用“ Percent Match
列作为key
来获取max
迭代次数,则可以找到Percent Match
最高的行:
Using this (tab-delimited) data as test.dat
: 使用此(制表符分隔的)数据作为test.dat
:
Name Organism Percent Match Misc info
1 Human 100 xxx
1 Goat 95 yyy
1 Pig 90 zzz
2 Mouse 95 yyy
2 Moose 90 zzz
2 Manatee 100 xxx
the program 该程序
import csv
maxrows = {}
with open('test.dat', 'rb') as f:
for row in csv.DictReader(f, delimiter = '\t'):
name = row['Name']
percent = int(row['Percent Match'])
if int(maxrows.get(name,row)['Percent Match']) <= percent:
maxrows[name] = row
print(maxrows)
yields 产量
{'1': {'info': None, 'Percent Match': '100', 'Misc': 'xxx', 'Organism': 'Human', 'Name': '1'}, '2': {'info': None, 'Percent Match': '100', 'Misc': 'xxx', 'Organism': 'Manatee', 'Name': '2'}}
You should be able to do something like this: 您应该能够执行以下操作:
lines = []
with open('data.txt') as file:
for line in file:
if line.startswith('1'):
lines.append(line.split())
best_match = max(lines, key=lambda k: int(k[2]))
After reading the file lines
would look something like this: 看完文件后lines
会是这个样子:
>>> pprint.pprint(lines)
[['1', 'Human', '100', 'xxx'],
['1', 'Goat', '95', 'yyy'],
['1', 'Pig', '90', 'zzz']]
And then you want to get the entry from lines
where the int
value of the third item is the highest, which can be expressed like this: 然后,您要从第三项的int
值最高的lines
中获取条目,可以将其表示为:
>>> max(lines, key=lambda k: int(k[2]))
['1', 'Human', '100', 'xxx']
So at the end of this best_match
will be a list with the data from the line you are interested in. 因此,此best_match
将是一个列表,其中包含您感兴趣的行中的数据。
Or if you wanted to get really tricky, you could get the line in one (complicated) step: 或者,如果您想变得非常棘手,则可以通过一个(复杂的)步骤进行操作:
with open('data.txt') as file:
best_match = max((s.split() for s in file if s.startswith('1')),
key=lambda k: int(k[2]))
I think you may be looking for something like: 我认为您可能正在寻找类似的东西:
from collections import defaultdict
results = defaultdict(list)
with open('data.txt') as f:
#next(f) # you may need this so skip the header
for line in f:
splitted = line.split()
results[splitted[0]].append(splitted[1:])
maxs = {}
for uid,data in results.items():
maxs[uid] = max(data, key=lambda k: int(k[1]))
I've testif on a file like: 我已经对以下文件进行了证明:
Name Organism Percent Match Misc info
1 Human 100 xxx
1 Goat 95 yyy
1 Pig 90 zzz
2 Pig 85 zzz
2 Goat 70 yyy
And the result was: 结果是:
{'1': ['Human', '100', 'xxx'], '2': ['Pig', '85', 'zzz']}
with open('datafile.txt', 'r') as f:
lines = file.read().split('\n')
matchDict = {}
for line in lines:
if line[0] == '1':
uid, organism, percent, misc = line.split('\t')
matchDict[int(percent)] = (organism, uid, misc)
highestMatch = max(matchDict.keys())
print('{0} is the highest match at {1} percent'.format(matchDict[highestMatch][0], highestMatch))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.