简体   繁体   中英

Fastest way of taking the average of each element which has the same value?

I'm not quite sure how to formulate this question. I'm almost certain this will have been asked before, but I can't find it.

I have some data, like:

x = np.random.rand(100) * 0.0001
y = [round(i, 1) for i in np.random.rand(100)]

They are both 100 elements long. However, y contains only about 10 unique elements.For each unique element in y , I want to take the average of all the numbers in x at the same position.

Something like:

averageX = []
for unique in set(y):
    items = []
    for i in y:
         if i == unique:         # For each copy of this number
              items.append(x[i]) # take all the items in x at that index
    averageX.append(mean(items)) # and take the average

What would be the best pythonic way to do this?

So... x is some data, y is a category map of sorts mapping each index of x to a category, and you need per-category averages?

import collections
import random

x = [random.randint(0, 100) for i in range(100)]  # data
y = [random.randint(0, 10) for i in range(100)]  # categories

data_per_category = collections.defaultdict(list)

for category, datum in zip(y, x):  # iterate in parallel over both y and x
    data_per_category[category].append(datum)

for category, data in data_per_category.items():
    print(category, sum(data) / len(data))

This prints out (eg)

9 51.2
5 49.0
8 56.75
1 48.166666666666664
7 45.0
0 38.42857142857143
3 50.333333333333336
4 43.7
6 45.4
10 53.0
2 44.583333333333336

If you convert to pandas you can take advantage of groupby

x = np.random.rand(100) * 0.0001
y = [round(i, 1) for i in np.random.rand(100)]

import pandas as pd
df=pd.DataFrame([x,y]).transpose().rename(columns={0:'x',1:'y'})
df.groupby(['y']).mean()

#Output:
#0.0  0.000019
#0.1  0.000046
#0.2  0.000051
#0.3  0.000049
#0.4  0.000031
#0.5  0.000043
#0.6  0.000051
#0.7  0.000049
#0.8  0.000044
#0.9  0.000053
#1.0  0.000034

I'm not sure about the efficiency, but you can use masking:

means = {}
for i in y:
    if i not in means:
        means[i] = x[y == i].mean()

Another way, probably somewhat more efficient, is sorting:

data = np.stack((x, y), axis=0)
data = data[np.lexsort(data), :]

Now the split is sequential, so you can do something as simple as

 breaks = np.flatnonzero(np.diff(data[:, 1]))
 start = np.concatenate(([0], breaks))
 end = np.concatenate((breaks, [data.shape[0]]))
 means = np.add.reduceat(data[:, 0], start) / (end - start)

In the sorted data, a non-zero diff in y indicates a new value of y . You can use that to compute the indices of the start and end of each segment in x that has the same y value. The sums of the segments are given by reduceat between the start indices.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM