I've got a numpy array with machine learning data, with over 500000 rows.
It looks like this:
[[1,2,3,4,1,0.3], [1,3,2,4,0,0.9], [3,2,5,4,0,0.8] ...]
The first 4 values are parameters, fifth is a class and sixth is probability for class 0.
Problem is, that the data is strongly polarized - there are over 20 times more rows with class 0 than with class 1. This is bad for learning and I need to remove many rows with class 0. But, for best results, I don't want to remove data at random, but like this:
I need to remove rows with highest value on index 5 (probability for class 0) in a loop as long, as there is the same count of rows with 0 and 1 in index 4 (class).
If there is a better solution than a loop, it's fantastic.
This is a little bit complicated, so if you have more questions, feel free to ask.
To make both classes have same number of elements by removing the ones with the highest probability from the majority class can be done as follows:
Call your matrix D, then your result is R.
_, (max_count, min_count) = np.unique(D[:, 4], return_counts=True)
sort_cols = D[:, 4:]
flipped_cols = np.flip(sort_cols.T, axis=0)
S = D[np.lexsort(flipped_cols)]
S[:max_count, :] = np.flip(S[:max_count, :], axis=0)
R = S[min_count:, :]
Assuming that in[:, 4] = (in[:, 5] < t).astype(int)
where t
is some threshhold value (probably 0.5
):
n = np.sum(in[:, 4]) # number of ones
i = np.argpartition(in[:, 5], 2 * n)[:2 * n] # index of bottom 2n p values
out = in[i] # or `np.sort(i)` to maintain original order
Otherwise:
nz = np.flatnonzero(in[:, 4]) # boolean index of `1` rows
z = np.flatnonzero(in[:, 4] == 0) # boolean index of `0` rows
n = nz.size # same as above
i = np.argpartition(in[z, 5], n)[:n] # bottom n p values from `0`
j = np.sort(np.r_[z[i], nz]) # combine `1` indices and bottom n `0` indices
out = in[j] # output
Let's generate some fake data
In [84]: import numpy as np
In [85]: from random import randint, random
In [86]: data = [[1,2,3,4, randint(0,2), random()] for _ in range(20)]
and change all class 2 rows to be class 0 rows, so we have (probably) a preponderance of zeros.
In [87]: for row in data: row[4] = 0 if row[4]==2 else row[4]
In your example you used a structured array, so I have a structured array as well... to make a structured array we need a list of tuples, not a list of lists
In [88]: data=[tuple(r) for r in data]
In [89]: dtype = [('a', int), ('b', int), ('c', int), ('d', int), ('class', int), ('p', float)]
In [90]: a = np.array(data, dtype=dtype)
In [91]: a
Out[91]:
array([(1, 2, 3, 4, 0, 0.92339399), (1, 2, 3, 4, 0, 0.04958431),
(1, 2, 3, 4, 0, 0.83051072), (1, 2, 3, 4, 1, 0.3753248 ),
(1, 2, 3, 4, 0, 0.44558775), (1, 2, 3, 4, 0, 0.49603591),
(1, 2, 3, 4, 0, 0.86809067), (1, 2, 3, 4, 0, 0.4207889 ),
(1, 2, 3, 4, 0, 0.79489487), (1, 2, 3, 4, 0, 0.60212444),
(1, 2, 3, 4, 0, 0.115112 ), (1, 2, 3, 4, 0, 0.61500626),
(1, 2, 3, 4, 0, 0.42648162), (1, 2, 3, 4, 0, 0.49199412),
(1, 2, 3, 4, 0, 0.37444409), (1, 2, 3, 4, 1, 0.8406318 ),
(1, 2, 3, 4, 0, 0.92859289), (1, 2, 3, 4, 0, 0.1409527 ),
(1, 2, 3, 4, 0, 0.82438293), (1, 2, 3, 4, 0, 0.95475589)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8'), ('d', '<i8'), ('class', '<i8'), ('p', '<f8')])
We can sort a structured array according to a sequence of its fields
In [93]: a = np.sort(a, order=('class','p',))
The records with class 1, how many of them
In [94]: b = a[a['class']==1]
In [95]: lb = len(b)
concatenate part of the class 0 records and b
In [100]: np.concatenate((a[a['class']==0][:lb], b))
Out[100]:
array([(1, 2, 3, 4, 0, 0.04958431), (1, 2, 3, 4, 0, 0.115112 ),
(1, 2, 3, 4, 0, 0.1409527 ), (1, 2, 3, 4, 0, 0.37444409),
(1, 2, 3, 4, 0, 0.4207889 ), (1, 2, 3, 4, 0, 0.42648162),
(1, 2, 3, 4, 1, 0.15497822), (1, 2, 3, 4, 1, 0.16193617),
(1, 2, 3, 4, 1, 0.25970286), (1, 2, 3, 4, 1, 0.29034866),
(1, 2, 3, 4, 1, 0.40348877), (1, 2, 3, 4, 1, 0.75604181)],
dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8'), ('d', '<i8'), ('class', '<i8'), ('p', '<f8')])
You can check that the output of the last expression is exactly what you asked for.
PS or at least it's what I think you've asked for...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.