简体   繁体   中英

Numpy array normalization by group ids:

Suppose data and labels be numpy arrays as below:

import numpy as np
data=np.array([[0,4,5,6,8],[0,6,8,9],[1,9,5],[1,45,7],[1,8,3]]) #Note: length of each row is different 
labels=np.array([4,6,10,4,6])

The first element in each row in data shows an id of a group . I want to normalize (see below example) the labels based on the group ids :

For example the first two rows in data have id=0; thus, their label must be:

normalized_labels[0]=labels[0]/(4+6)=0.4 
normalized_labels[1]=labels[1]/(4+6)=0.6

The expected output should be:

normalized_labels=[0.4,0.6,0.5,0.2,0.3]   

I have a naive solution as:

ids=[data[i][0] for i in range(data.shape[0])]
out=[]
for i in set(ids):
    ind=np.where(ids==i)
    out.extend(list(labels[ind]/np.sum(labels[ind])))
out=np.array(out)
print(out)

Is there any numpy functions to perform such a task. Any suggestion is appreciated!!

I found this kind of subtle way to transform labels into sums of groups with respect to indices = [n[0] for n in data] . In later solution, no use of data is needed:

indices = [n[0] for n in data]
u, inv = np.unique(indices, return_inverse=True)
bincnt = np.bincount(inv, weights=labels)
sums = bincnt[inv]

Now sums are: array([10., 10., 20., 20., 20.]) . The further is simple like so:

normalized_labels = labels / sums

Remarks. np.bincount calculates weighted sums of items labeled as 0, 1, 2... This is why reindexation indices -> inv is needed. For example, indices = [8, 6, 4, 3, 4, 6, 8, 8] should be mapped into inv = [3, 2, 1, 0, 1, 2, 3, 3] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM