简体   繁体   中英

Given a dictionary of dictionaries where keys are bins and values are frequency, how do I efficiently calculate the mean and std of the bins?

I have a dictionary of dictionaries where the keys of the inner dictionary represents bins of a histogram and the values represent the frequency. I want to calculate the mean bin and the std of the bins.

dict = {'Group 1' : {1 : 100, 2:300, 4:100, 5:50},
       {'Group 2' : {1 : 50, 2: 300},
       {'Group 3' : {4 : 100, 5: 200},
        ...}

Example For Group 1 I want to get the mean and std identical to taking the mean and std of a list of 100 1's, 300 2's, 100 4's, 50 5's

l = []
l.extend([1 for j in range(0,100)])
l.extend([2 for j in range(0,300)])
l.extend([4 for j in range(0,100)])
l.extend([5 for j in range(0,50)])
np.mean(l) = 2.45
np.std(l) = 1.23

What would be the best way to iterate over each dictionary and transform it such that I get a dictionary of dictionaries representing the mean and std of the bins of the inner dictionaries?

transformed_dictionary = {'Group 1' : {'mean': 2.45 , 'std' : 1.23},
                          'Group 2' : {...},
                           ...}
   

What could be an efficient way of doing this?

First, you should not name your dictionary dict , which will mask the builtin class named dict . Second, your declaration of dict is not quite correct (and it is not a "dictionary of dictionaries" -- it is a dictionary whose values are dictionaries).

import numpy as np

d = {'Group 1' : {1 : 100, 2:300, 4:100, 5:50},
       'Group 2' : {1 : 50, 2: 300},
       'Group 3' : {4 : 100, 5: 200}
       }

transformed_dictionary = {}
for k, v in d.items():
    l = []
    for item in v.items():
        l.extend([item[0] for j in range(item[1])])
    transformed_dictionary[k] = {'mean': np.mean(l), 'std': np.std(l)}
print(transformed_dictionary)

Prints:

{'Group 1': {'mean': 2.4545454545454546, 'std': 1.233150906022776}, 'Group 2': {'mean': 1.8571428571428572, 'std': 0.34992710611188255}, 'Group 3': {'mean': 4.666666666666667, 'std': 0.4714045207910316}}

To avoid building auxiliary list, you can use np.average with weights= parameter:

def weighted_avg_and_std(values, weights):
    """
    Return the weighted average and standard deviation.

    values, weights -- Numpy ndarrays with the same shape.
    """
    average = np.average(values, weights=weights)
    # Fast and numerically precise:
    variance = np.average((values-average)**2, weights=weights)
    return average, np.sqrt(variance)

d = {'Group 1' : {1 : 100, 2:300, 4:100, 5:50},
     'Group 2' : {1 : 50, 2: 300},
     'Group 3' : {4 : 100, 5: 200}}

out = {}
for k, v in d.items():
    m, s = weighted_avg_and_std([*v], [*v.values()])
    out[k] = {
        'mean': m,
        'std': s
    }

print(out)

Prints:

{'Group 1': {'mean': 2.4545454545454546, 'std': 1.2331509060227759}, 
 'Group 2': {'mean': 1.8571428571428572, 'std': 0.3499271061118826}, 
 'Group 3': {'mean': 4.666666666666667, 'std': 0.4714045207910317}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM