Given a dictionary of dictionaries where keys are bins and values are frequency, how do I efficiently calculate the mean and std of the bins?

Question

I have a dictionary of dictionaries where the keys of the inner dictionary represents bins of a histogram and the values represent the frequency. I want to calculate the mean bin and the std of the bins.

dict = {'Group 1' : {1 : 100, 2:300, 4:100, 5:50},
       {'Group 2' : {1 : 50, 2: 300},
       {'Group 3' : {4 : 100, 5: 200},
        ...}

Example For Group 1 I want to get the mean and std identical to taking the mean and std of a list of 100 1's, 300 2's, 100 4's, 50 5's

l = []
l.extend([1 for j in range(0,100)])
l.extend([2 for j in range(0,300)])
l.extend([4 for j in range(0,100)])
l.extend([5 for j in range(0,50)])
np.mean(l) = 2.45
np.std(l) = 1.23

What would be the best way to iterate over each dictionary and transform it such that I get a dictionary of dictionaries representing the mean and std of the bins of the inner dictionaries?

transformed_dictionary = {'Group 1' : {'mean': 2.45 , 'std' : 1.23},
                          'Group 2' : {...},
                           ...}

What could be an efficient way of doing this?

Answer 1

First, you should not name your dictionary dict , which will mask the builtin class named dict . Second, your declaration of dict is not quite correct (and it is not a "dictionary of dictionaries" -- it is a dictionary whose values are dictionaries).

import numpy as np

d = {'Group 1' : {1 : 100, 2:300, 4:100, 5:50},
       'Group 2' : {1 : 50, 2: 300},
       'Group 3' : {4 : 100, 5: 200}
       }

transformed_dictionary = {}
for k, v in d.items():
    l = []
    for item in v.items():
        l.extend([item[0] for j in range(item[1])])
    transformed_dictionary[k] = {'mean': np.mean(l), 'std': np.std(l)}
print(transformed_dictionary)

Prints:

{'Group 1': {'mean': 2.4545454545454546, 'std': 1.233150906022776}, 'Group 2': {'mean': 1.8571428571428572, 'std': 0.34992710611188255}, 'Group 3': {'mean': 4.666666666666667, 'std': 0.4714045207910316}}

Answer 2

To avoid building auxiliary list, you can use np.average with weights= parameter:

def weighted_avg_and_std(values, weights):
    """
    Return the weighted average and standard deviation.

    values, weights -- Numpy ndarrays with the same shape.
    """
    average = np.average(values, weights=weights)
    # Fast and numerically precise:
    variance = np.average((values-average)**2, weights=weights)
    return average, np.sqrt(variance)

d = {'Group 1' : {1 : 100, 2:300, 4:100, 5:50},
     'Group 2' : {1 : 50, 2: 300},
     'Group 3' : {4 : 100, 5: 200}}

out = {}
for k, v in d.items():
    m, s = weighted_avg_and_std([*v], [*v.values()])
    out[k] = {
        'mean': m,
        'std': s
    }

print(out)

Prints:

{'Group 1': {'mean': 2.4545454545454546, 'std': 1.2331509060227759}, 
 'Group 2': {'mean': 1.8571428571428572, 'std': 0.3499271061118826}, 
 'Group 3': {'mean': 4.666666666666667, 'std': 0.4714045207910317}}

Given a dictionary of dictionaries where keys are bins and values are frequency, how do I efficiently calculate the mean and std of the bins?

Question

2 answers

solution1
2 2020-09-28 13:33:25

solution2
2 ACCPTED 2020-09-28 13:38:03

Given a dictionary of dictionaries where keys are bins and values are frequency, how do I efficiently calculate the mean and std of the bins?

Question

2 answers

solution1 2 2020-09-28 13:33:25

solution2 2 ACCPTED 2020-09-28 13:38:03

solution1
2 2020-09-28 13:33:25

solution2
2 ACCPTED 2020-09-28 13:38:03