python过滤2d数组的一大块数据

Question

import numpy as np

data = np.array([
    [20,  0,  5,  1],
    [20,  0,  5,  1],
    [20,  0,  5,  0],
    [20,  1,  5,  0],
    [20,  1,  5,  0],
    [20,  2,  5,  1],
    [20,  3,  5,  0],
    [20,  3,  5,  0],
    [20,  3,  5,  1],
    [20,  4,  5,  0],
    [20,  4,  5,  0],
    [20,  4,  5,  0]
])

I have the following 2d array. 我有以下2d数组。 lets called the fields a, b, c, d in the above order where column b is like id . 让我们按照上面的顺序调用字段a, b, c, d ，其中列b就像id 。 I wish to delete all cells that doesnt have atlist 1 appearance of the number "1" in column d for all cells with the same number in column b (same id) so after filtering i will have the following results: 对于列b具有相同编号的所有单元格（相同的id），我希望删除列d中没有数字“1”的atlist 1外观的所有单元格，因此在过滤后我将得到以下结果：

[[20  0  5  1]
 [20  0  5  1]
 [20  0  5  0]
 [20  2  5  1]
 [20  3  5  0]
 [20  3  5  0]
 [20  3  5  1]]

all rows with b = 1 and b = 4 have been deleted from the data 已从数据中删除b = 1和b = 4所有行

to sum up because I see answers that doesnt fit. 总结，因为我看到不适合的答案。 we look at chunks of data by the b column. 我们通过b列查看数据块。 if a complete chunk of data doesnt have even one appearance of the number "1" in column d we delete all the rows of that b item. 如果完整的数据块在列d甚至没有出现数字“1”，则删除该b项的所有行。 in the following example we can see a chunk of data with b = 1 and b = 4 ("id" = 1 and "id" = 4) that have 0 appearances of the number "1" in column d . 在下面的例子中，我们可以看到一个数据块，其中b = 1且b = 4 （“id”= 1和“id”= 4），在d列中出现0个数字“1”。 thats why it gets deleted from the data 这就是为什么它会从数据中删除

Answer 1

Generic approach : Here's an approach using np.unique and np.bincount to solve for a generic case - 通用方法：这是一种使用np.unique和np.bincount来解决一般情况的方法 -

unq,tags = np.unique(data[:,1],return_inverse=1)
goodIDs = np.flatnonzero(np.bincount(tags,data[:,3]==1)>=1)
out = data[np.in1d(tags,goodIDs)]

Sample run - 样品运行 -

In [15]: data
Out[15]: 
array([[20, 10,  5,  1],
       [20, 73,  5,  0],
       [20, 73,  5,  1],
       [20, 31,  5,  0],
       [20, 10,  5,  1],
       [20, 10,  5,  0],
       [20, 42,  5,  1],
       [20, 54,  5,  0],
       [20, 73,  5,  0],
       [20, 54,  5,  0],
       [20, 54,  5,  0],
       [20, 31,  5,  0]])

In [16]: out
Out[16]: 
array([[20, 10,  5,  1],
       [20, 73,  5,  0],
       [20, 73,  5,  1],
       [20, 10,  5,  1],
       [20, 10,  5,  0],
       [20, 42,  5,  1],
       [20, 73,  5,  0]])

Specific case approach : If the second column data is always sorted and have sequential numbers starting from 0 , we can use a simplified version, like so - 具体案例方法：如果第二列数据总是排序并且序列号从0开始，我们可以使用简化版本，如下所示 -

goodIDs = np.flatnonzero(np.bincount(data[:,1],data[:,3]==1)>=1)
out = data[np.in1d(data[:,1],goodIDs)]

Sample run - 样品运行 -

In [44]: data
Out[44]: 
array([[20,  0,  5,  1],
       [20,  0,  5,  1],
       [20,  0,  5,  0],
       [20,  1,  5,  0],
       [20,  1,  5,  0],
       [20,  2,  5,  1],
       [20,  3,  5,  0],
       [20,  3,  5,  0],
       [20,  3,  5,  1],
       [20,  4,  5,  0],
       [20,  4,  5,  0],
       [20,  4,  5,  0]])

In [45]: out
Out[45]: 
array([[20,  0,  5,  1],
       [20,  0,  5,  1],
       [20,  0,  5,  0],
       [20,  2,  5,  1],
       [20,  3,  5,  0],
       [20,  3,  5,  0],
       [20,  3,  5,  1]])

Also, if data[:,3] always have ones and zeros, we can just use data[:,3] in place of data[:,3]==1 in the above listed codes. 此外，如果data[:,3]总是有1和0，我们可以在上面列出的代码中使用data[:,3]代替data[:,3]==1 。

Benchmarking 标杆

Let's benchmark the vectorized approaches on the specific case for a larger array - 让我们对特定情况下的矢量化方法进行基准测试，以获得更大的数组 -

In [69]: def logical_or_based(data): #@ Eric's soln
    ...:     b_vals = data[:,1]
    ...:     d_vals = data[:,3]
    ...:     is_ok = np.zeros(np.max(b_vals) + 1, dtype=np.bool_)
    ...:     np.logical_or.at(is_ok, b_vals, d_vals)
    ...:     return is_ok[b_vals]
    ...: 
    ...: def in1d_based(data):
    ...:     goodIDs = np.flatnonzero(np.bincount(data[:,1],data[:,3])!=0)
    ...:     out = np.in1d(data[:,1],goodIDs)
    ...:     return out
    ...: 

In [70]: # Setup input
    ...: data = np.random.randint(0,100,(10000,4))
    ...: data[:,1] = np.sort(np.random.randint(0,100,(10000)))
    ...: data[:,3] = np.random.randint(0,2,(10000))
    ...: 

In [71]: %timeit logical_or_based(data) #@ Eric's soln
1000 loops, best of 3: 1.44 ms per loop

In [72]: %timeit in1d_based(data)
1000 loops, best of 3: 528 µs per loop

Answer 2

code: 码：

import numpy as np

my_list = [[20,0,5,1],
    [20,0,5,1],
    [20,0,5,0],
    [20,1,5,0],
    [20,1,5,0],
    [20,2,5,1],
    [20,3,5,0],
    [20,3,5,0],
    [20,3,5,1],
    [20,4,5,0],
    [20,4,5,0],
    [20,4,5,0]]

all_ids = np.array(my_list)[:,1]
unique_ids = np.unique(all_ids)
indices = [np.where(all_ids==ui)[0][0] for ui in unique_ids ]

final = []
for id in unique_ids:
    try:
        tmp_group = my_list[indices[id]:indices[id+1]]
    except:
        tmp_group = my_list[indices[id]:]
    if 1 in np.array(tmp_group)[:,3]:
        final.extend(tmp_group)

print np.array(final)

result: 结果：

[[20  0  5  1]
 [20  0  5  1]
 [20  0  5  0]
 [20  2  5  1]
 [20  3  5  0]
 [20  3  5  0]
 [20  3  5  1]]

Answer 3

This gets rid of all rows with 1 in the second position: 这消除了第二个位置为1的所有行：

[sublist for sublist in list_ if sublist[1] != 1]

This get's rid of all rows with 1 in the second position unless the fourth position is also 1: 除非第四个位置也是1，否则在第二个位置除去所有行，除非第四个位置为1：

[sublist for sublist in list_ if not (sublist[1] == 1 and sublist[3] != 1) ]

Answer 4

Let's assume the following: 我们假设如下：

b >= 0
b is an integer b是整数
b is fairly dense, ie max(b) ~= len(unique(b)) b相当密集，即max(b) ~= len(unique(b))

Here's a solution using np.ufunc.at : 这是使用np.ufunc.at的解决方案：

# unpack for clarity - this costs nothing in numpy
b_vals = data[:,1]
d_vals = data[:,3]

# build an array indexed by b values
is_ok = np.zeros(np.max(b_vals) + 1, dtype=np.bool_)
np.logical_or.at(is_ok, b_vals, d_vals)
# is_ok == array([ True, False,  True,  True, False], dtype=bool)

# take the rows which have a b value that was deemed OK
result = data[is_ok[b_vals]]

np.logical_or.at(is_ok, b_vals, d_vals) is a more efficient version of: np.logical_or.at(is_ok, b_vals, d_vals)是一个更有效的版本：

for idx, val in zip(b_vals, d_vals):
    is_ok[idx] = np.logical_or(is_ok[idx], val)

Answer 5

Untested since in a hurry, but this should work: 从匆忙中未经测试，但这应该工作：

import numpy_indexed as npi
g = npi.group_by(data[:, 1])
ids, valid = g.any(data[:, 3])
result = data[valid[g.inverse]]

python过滤2d数组的一大块数据

问题描述

5 个解决方案

解决方案1
3 已采纳 2016-10-19 14:32:19

解决方案2
1 2016-10-19 13:14:34

解决方案3
1 2016-10-19 13:16:29

解决方案4
1 2016-10-19 14:37:34

解决方案5
1 2016-10-19 16:09:35

python过滤2d数组的一大块数据

问题描述

5 个解决方案

解决方案1 3 已采纳 2016-10-19 14:32:19

解决方案2 1 2016-10-19 13:14:34

解决方案3 1 2016-10-19 13:16:29

解决方案4 1 2016-10-19 14:37:34

解决方案5 1 2016-10-19 16:09:35

解决方案1
3 已采纳 2016-10-19 14:32:19

解决方案2
1 2016-10-19 13:14:34

解决方案3
1 2016-10-19 13:16:29

解决方案4
1 2016-10-19 14:37:34

解决方案5
1 2016-10-19 16:09:35