简体   繁体   English

Python近似分组

[英]Python approximate group-by

I want to group the keys of a dict by their values. 我想按其值对字典键进行分组。 However, the values are only approximately equal. 但是,这些值仅近似相等。 What's the best approach to doing a groupby in this scenario. 在这种情况下进行分组的最好方法是什么。 I have: 我有:

buckets = defaultdict(list)
for k, v in my_dict.iteritems():
    closest = next((rep for rep in buckets if abs(rep - v) < 1e-3), None)
    if closest:
        buckets[closest].append(k)
    else:
        buckets[v].append(k)

Any itertools magic or other stuff that could simplify this/make it more pythonic, or is this the best I can do? 任何itertools魔术或其他可以简化此操作/使其更具有pythonic的功能,或者这是我能做的最好的事情吗?

Your algorithm is O(n**2) since it is performing O(n) operations inside an O(n) loop: 您的算法为O(n**2)因为它在O(n)循环内执行O(n)运算:

for k, v in my_dict.iteritems():
    closest = next((rep for rep in buckets if abs(rep - v) < 1e-3), None)

You could make it O(n log n) by sorting my_dict.items() by values, and then looping over the sorted items. 通过按值对my_dict.items()进行排序,然后循环遍历已排序的项,可以将其设置为O(n log n) Notice that instead of for rep in buckets , if buckets is an OrderedDict , you only have to look at the last bucket since the keys of OrderedDict will be in sorted order. 请注意,如果bucketsOrderedDict ,则无需for rep in bucketsfor rep in buckets ,只需查看最后一个存储桶,因为OrderedDict的键将按排序顺序。 So if the next value is close to any bucket, it has to be close to the last bucket. 因此,如果下一个值接近任何存储桶,则它必须接近最后一个存储桶。 Thus, by using an OrderedDict , you do not need to loop over all the buckets. 因此,通过使用OrderedDict ,您不需要遍历所有存储桶。 Just compare with the last one: 只需与最后一个比较即可:

import random
random.seed(123)
N = 10
my_dict = dict(zip(range(N), [random.randint(0, 10)/10.0 for k in range(N)]))
print(my_dict)    
# {0: 0.0, 1: 0.0, 2: 0.4, 3: 0.1, 4: 0.9, 5: 0.0, 6: 0.5, 7: 0.3, 8: 0.9, 9: 0.1}

import operator
import collections
items = sorted(my_dict.items(), key=operator.itemgetter(1))
buckets = collections.OrderedDict([(items[0][1], [items[0][0]])])
for k, v in items[1:]:
    last_val = next(reversed(buckets))
    closest = last_val if abs(last_val - v) < 1e-3 else v
    buckets.setdefault(closest, []).append(k) 

print(buckets)

prints 版画

OrderedDict([(0.0, [0, 1, 5]), (0.1, [3, 9]), (0.3, [7]), (0.4, [2]), (0.5, [6]), (0.9, [4, 8])])

This would be a slightly more "pythonic": 这会稍微有点“ pythonic”:

buckets = defaultdict(list)
for k, v in my_dict.iteritems():
    try:
        closest = next((rep for rep in buckets if abs(rep - v) < 1e-3))
        buckets[closest].append(k)
    except StopIteration:
        buckets[v].append(k)

Aside from your code being inefficient it doesn't guarantee same or any particular result each time since .itetitems() order could be arbitrary. 除了您的代码效率低下外,由于.itetitems()顺序可能是任意的,因此也不保证每次都相同或任何特定的结果。 To solve both of that you can simply use key function: 要解决这两个问题,您只需使用键功能:

key = lambda x: round(x, 3)

And then you group the usual way, but using key(v) as index: 然后,您按照通常的方式进行分组,但是使用key(v)作为索引:

buckets = defaultdict(list)
for k, v in my_dict.iteritems():
    buckets[key(v)].append(k)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM