简体   繁体   中英

Remove elements from 2D numpy array based on specific value

I've got a numpy array with machine learning data, with over 500000 rows.

It looks like this:

[[1,2,3,4,1,0.3], [1,3,2,4,0,0.9], [3,2,5,4,0,0.8] ...]

The first 4 values are parameters, fifth is a class and sixth is probability for class 0.

Problem is, that the data is strongly polarized - there are over 20 times more rows with class 0 than with class 1. This is bad for learning and I need to remove many rows with class 0. But, for best results, I don't want to remove data at random, but like this:

I need to remove rows with highest value on index 5 (probability for class 0) in a loop as long, as there is the same count of rows with 0 and 1 in index 4 (class).

If there is a better solution than a loop, it's fantastic.

This is a little bit complicated, so if you have more questions, feel free to ask.

To make both classes have same number of elements by removing the ones with the highest probability from the majority class can be done as follows:

Call your matrix D, then your result is R.

_, (max_count, min_count) = np.unique(D[:, 4], return_counts=True)
sort_cols = D[:, 4:]
flipped_cols = np.flip(sort_cols.T, axis=0)
S = D[np.lexsort(flipped_cols)]
S[:max_count, :] = np.flip(S[:max_count, :], axis=0)
R = S[min_count:, :]

Explanation

  1. Get the majority and minority class sample counts
    • This line relies on the assumption that majority class is labeled as 0 and minority as 1. Adjust this line to your needs.
  2. Get the columns to be used during sorting.
  3. flipped_cols will be used in np.lexsort.
  4. This is the important bit. This line sorts your data first with respect to the label column, and then with respect to the probability column. In the end, what you get is that in the upper part of your matrix you have the majority data, and in the lower part the minority data. These parts themselves are sorted with respect to the probability.
  5. Reverse the majority part rows since we need to remove the rows with the highest probability.
  6. You get min_count many majority rows and all the minority rows. In this way, your result matrix contains equal number of majority and minority samples.

References

Assuming that in[:, 4] = (in[:, 5] < t).astype(int) where t is some threshhold value (probably 0.5 ):

n = np.sum(in[:, 4])                           # number of ones
i = np.argpartition(in[:, 5], 2 * n)[:2 * n]   # index of bottom 2n p values
out = in[i]                                    # or `np.sort(i)` to maintain original order

Otherwise:

nz  = np.flatnonzero(in[:, 4])         # boolean index of `1` rows
z   = np.flatnonzero(in[:, 4] == 0)    # boolean index of `0` rows
n   = nz.size                          # same as above
i   = np.argpartition(in[z, 5], n)[:n] # bottom n p values from `0`
j   = np.sort(np.r_[z[i], nz])         # combine `1` indices and bottom n `0` indices
out = in[j]                            # output

Let's generate some fake data

In [84]: import numpy as np
In [85]: from random import randint, random
In [86]: data = [[1,2,3,4, randint(0,2), random()] for _ in range(20)]

and change all class 2 rows to be class 0 rows, so we have (probably) a preponderance of zeros.

In [87]: for row in data: row[4] = 0 if row[4]==2 else row[4]

In your example you used a structured array, so I have a structured array as well... to make a structured array we need a list of tuples, not a list of lists

In [88]: data=[tuple(r) for r in data]
In [89]: dtype = [('a', int), ('b', int), ('c', int), ('d', int), ('class', int), ('p', float)]
In [90]: a = np.array(data, dtype=dtype)
In [91]: a
Out[91]: 
array([(1, 2, 3, 4, 0,  0.92339399), (1, 2, 3, 4, 0,  0.04958431),
       (1, 2, 3, 4, 0,  0.83051072), (1, 2, 3, 4, 1,  0.3753248 ),
       (1, 2, 3, 4, 0,  0.44558775), (1, 2, 3, 4, 0,  0.49603591),
       (1, 2, 3, 4, 0,  0.86809067), (1, 2, 3, 4, 0,  0.4207889 ),
       (1, 2, 3, 4, 0,  0.79489487), (1, 2, 3, 4, 0,  0.60212444),
       (1, 2, 3, 4, 0,  0.115112  ), (1, 2, 3, 4, 0,  0.61500626),
       (1, 2, 3, 4, 0,  0.42648162), (1, 2, 3, 4, 0,  0.49199412),
       (1, 2, 3, 4, 0,  0.37444409), (1, 2, 3, 4, 1,  0.8406318 ),
       (1, 2, 3, 4, 0,  0.92859289), (1, 2, 3, 4, 0,  0.1409527 ),
       (1, 2, 3, 4, 0,  0.82438293), (1, 2, 3, 4, 0,  0.95475589)],
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8'), ('d', '<i8'), ('class', '<i8'), ('p', '<f8')])

We can sort a structured array according to a sequence of its fields

In [93]: a = np.sort(a, order=('class','p',))

The records with class 1, how many of them

In [94]: b = a[a['class']==1]
In [95]: lb = len(b)

concatenate part of the class 0 records and b

In [100]: np.concatenate((a[a['class']==0][:lb], b))
Out[100]: 
array([(1, 2, 3, 4, 0,  0.04958431), (1, 2, 3, 4, 0,  0.115112  ), 
       (1, 2, 3, 4, 0,  0.1409527 ), (1, 2, 3, 4, 0,  0.37444409),
       (1, 2, 3, 4, 0,  0.4207889 ), (1, 2, 3, 4, 0,  0.42648162),
       (1, 2, 3, 4, 1,  0.15497822), (1, 2, 3, 4, 1,  0.16193617),
       (1, 2, 3, 4, 1,  0.25970286), (1, 2, 3, 4, 1,  0.29034866),
       (1, 2, 3, 4, 1,  0.40348877), (1, 2, 3, 4, 1,  0.75604181)],
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8'), ('d', '<i8'), ('class', '<i8'), ('p', '<f8')])

You can check that the output of the last expression is exactly what you asked for.


PS or at least it's what I think you've asked for...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM