I'm not quite sure how to formulate this question. I'm almost certain this will have been asked before, but I can't find it.
I have some data, like:
x = np.random.rand(100) * 0.0001
y = [round(i, 1) for i in np.random.rand(100)]
They are both 100 elements long. However, y
contains only about 10 unique elements.For each unique element in y
, I want to take the average of all the numbers in x
at the same position.
Something like:
averageX = []
for unique in set(y):
items = []
for i in y:
if i == unique: # For each copy of this number
items.append(x[i]) # take all the items in x at that index
averageX.append(mean(items)) # and take the average
What would be the best pythonic way to do this?
So... x
is some data, y
is a category map of sorts mapping each index of x
to a category, and you need per-category averages?
import collections
import random
x = [random.randint(0, 100) for i in range(100)] # data
y = [random.randint(0, 10) for i in range(100)] # categories
data_per_category = collections.defaultdict(list)
for category, datum in zip(y, x): # iterate in parallel over both y and x
data_per_category[category].append(datum)
for category, data in data_per_category.items():
print(category, sum(data) / len(data))
This prints out (eg)
9 51.2
5 49.0
8 56.75
1 48.166666666666664
7 45.0
0 38.42857142857143
3 50.333333333333336
4 43.7
6 45.4
10 53.0
2 44.583333333333336
If you convert to pandas you can take advantage of groupby
x = np.random.rand(100) * 0.0001
y = [round(i, 1) for i in np.random.rand(100)]
import pandas as pd
df=pd.DataFrame([x,y]).transpose().rename(columns={0:'x',1:'y'})
df.groupby(['y']).mean()
#Output:
#0.0 0.000019
#0.1 0.000046
#0.2 0.000051
#0.3 0.000049
#0.4 0.000031
#0.5 0.000043
#0.6 0.000051
#0.7 0.000049
#0.8 0.000044
#0.9 0.000053
#1.0 0.000034
I'm not sure about the efficiency, but you can use masking:
means = {}
for i in y:
if i not in means:
means[i] = x[y == i].mean()
Another way, probably somewhat more efficient, is sorting:
data = np.stack((x, y), axis=0)
data = data[np.lexsort(data), :]
Now the split is sequential, so you can do something as simple as
breaks = np.flatnonzero(np.diff(data[:, 1]))
start = np.concatenate(([0], breaks))
end = np.concatenate((breaks, [data.shape[0]]))
means = np.add.reduceat(data[:, 0], start) / (end - start)
In the sorted data, a non-zero diff in y
indicates a new value of y
. You can use that to compute the indices of the start and end of each segment in x
that has the same y
value. The sums of the segments are given by reduceat
between the start indices.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.