[英]python filter 2d array by a chunk of data
import numpy as np
data = np.array([
[20, 0, 5, 1],
[20, 0, 5, 1],
[20, 0, 5, 0],
[20, 1, 5, 0],
[20, 1, 5, 0],
[20, 2, 5, 1],
[20, 3, 5, 0],
[20, 3, 5, 0],
[20, 3, 5, 1],
[20, 4, 5, 0],
[20, 4, 5, 0],
[20, 4, 5, 0]
])
I have the following 2d array. 我有以下2d数组。 lets called the fields a, b, c, d
in the above order where column b
is like id
. 让我们按照上面的顺序调用字段a, b, c, d
,其中列b
就像id
。 I wish to delete all cells that doesnt have atlist 1 appearance of the number "1" in column d
for all cells with the same number in column b
(same id) so after filtering i will have the following results: 对于列b
具有相同编号的所有单元格(相同的id),我希望删除列d
中没有数字“1”的atlist 1外观的所有单元格,因此在过滤后我将得到以下结果:
[[20 0 5 1]
[20 0 5 1]
[20 0 5 0]
[20 2 5 1]
[20 3 5 0]
[20 3 5 0]
[20 3 5 1]]
all rows with b = 1
and b = 4
have been deleted from the data 已从数据中删除b = 1
和b = 4
所有行
to sum up because I see answers that doesnt fit. 总结,因为我看到不适合的答案。 we look at chunks of data by the b
column. 我们通过b
列查看数据块。 if a complete chunk of data doesnt have even one appearance of the number "1" in column d
we delete all the rows of that b
item. 如果完整的数据块在列d
甚至没有出现数字“1”,则删除该b
项的所有行。 in the following example we can see a chunk of data with b = 1
and b = 4
("id" = 1 and "id" = 4) that have 0 appearances of the number "1" in column d
. 在下面的例子中,我们可以看到一个数据块,其中b = 1
且b = 4
(“id”= 1和“id”= 4),在d
列中出现0个数字“1”。 thats why it gets deleted from the data 这就是为什么它会从数据中删除
Generic approach : Here's an approach using np.unique
and np.bincount
to solve for a generic case - 通用方法:这是一种使用np.unique
和np.bincount
来解决一般情况的方法 -
unq,tags = np.unique(data[:,1],return_inverse=1)
goodIDs = np.flatnonzero(np.bincount(tags,data[:,3]==1)>=1)
out = data[np.in1d(tags,goodIDs)]
Sample run - 样品运行 -
In [15]: data
Out[15]:
array([[20, 10, 5, 1],
[20, 73, 5, 0],
[20, 73, 5, 1],
[20, 31, 5, 0],
[20, 10, 5, 1],
[20, 10, 5, 0],
[20, 42, 5, 1],
[20, 54, 5, 0],
[20, 73, 5, 0],
[20, 54, 5, 0],
[20, 54, 5, 0],
[20, 31, 5, 0]])
In [16]: out
Out[16]:
array([[20, 10, 5, 1],
[20, 73, 5, 0],
[20, 73, 5, 1],
[20, 10, 5, 1],
[20, 10, 5, 0],
[20, 42, 5, 1],
[20, 73, 5, 0]])
Specific case approach : If the second column data is always sorted and have sequential numbers starting from 0
, we can use a simplified version, like so - 具体案例方法:如果第二列数据总是排序并且序列号从0
开始,我们可以使用简化版本,如下所示 -
goodIDs = np.flatnonzero(np.bincount(data[:,1],data[:,3]==1)>=1)
out = data[np.in1d(data[:,1],goodIDs)]
Sample run - 样品运行 -
In [44]: data
Out[44]:
array([[20, 0, 5, 1],
[20, 0, 5, 1],
[20, 0, 5, 0],
[20, 1, 5, 0],
[20, 1, 5, 0],
[20, 2, 5, 1],
[20, 3, 5, 0],
[20, 3, 5, 0],
[20, 3, 5, 1],
[20, 4, 5, 0],
[20, 4, 5, 0],
[20, 4, 5, 0]])
In [45]: out
Out[45]:
array([[20, 0, 5, 1],
[20, 0, 5, 1],
[20, 0, 5, 0],
[20, 2, 5, 1],
[20, 3, 5, 0],
[20, 3, 5, 0],
[20, 3, 5, 1]])
Also, if data[:,3]
always have ones and zeros, we can just use data[:,3]
in place of data[:,3]==1
in the above listed codes. 此外,如果data[:,3]
总是有1和0,我们可以在上面列出的代码中使用data[:,3]
代替data[:,3]==1
。
Benchmarking 标杆
Let's benchmark the vectorized approaches on the specific case for a larger array - 让我们对特定情况下的矢量化方法进行基准测试,以获得更大的数组 -
In [69]: def logical_or_based(data): #@ Eric's soln
...: b_vals = data[:,1]
...: d_vals = data[:,3]
...: is_ok = np.zeros(np.max(b_vals) + 1, dtype=np.bool_)
...: np.logical_or.at(is_ok, b_vals, d_vals)
...: return is_ok[b_vals]
...:
...: def in1d_based(data):
...: goodIDs = np.flatnonzero(np.bincount(data[:,1],data[:,3])!=0)
...: out = np.in1d(data[:,1],goodIDs)
...: return out
...:
In [70]: # Setup input
...: data = np.random.randint(0,100,(10000,4))
...: data[:,1] = np.sort(np.random.randint(0,100,(10000)))
...: data[:,3] = np.random.randint(0,2,(10000))
...:
In [71]: %timeit logical_or_based(data) #@ Eric's soln
1000 loops, best of 3: 1.44 ms per loop
In [72]: %timeit in1d_based(data)
1000 loops, best of 3: 528 µs per loop
code: 码:
import numpy as np
my_list = [[20,0,5,1],
[20,0,5,1],
[20,0,5,0],
[20,1,5,0],
[20,1,5,0],
[20,2,5,1],
[20,3,5,0],
[20,3,5,0],
[20,3,5,1],
[20,4,5,0],
[20,4,5,0],
[20,4,5,0]]
all_ids = np.array(my_list)[:,1]
unique_ids = np.unique(all_ids)
indices = [np.where(all_ids==ui)[0][0] for ui in unique_ids ]
final = []
for id in unique_ids:
try:
tmp_group = my_list[indices[id]:indices[id+1]]
except:
tmp_group = my_list[indices[id]:]
if 1 in np.array(tmp_group)[:,3]:
final.extend(tmp_group)
print np.array(final)
result: 结果:
[[20 0 5 1]
[20 0 5 1]
[20 0 5 0]
[20 2 5 1]
[20 3 5 0]
[20 3 5 0]
[20 3 5 1]]
This gets rid of all rows with 1 in the second position: 这消除了第二个位置为1的所有行:
[sublist for sublist in list_ if sublist[1] != 1]
This get's rid of all rows with 1 in the second position unless the fourth position is also 1: 除非第四个位置也是1,否则在第二个位置除去所有行,除非第四个位置为1:
[sublist for sublist in list_ if not (sublist[1] == 1 and sublist[3] != 1) ]
Let's assume the following: 我们假设如下:
b >= 0
b
is an integer b
是整数 b
is fairly dense, ie max(b) ~= len(unique(b))
b
相当密集,即max(b) ~= len(unique(b))
Here's a solution using np.ufunc.at
: 这是使用np.ufunc.at
的解决方案:
# unpack for clarity - this costs nothing in numpy
b_vals = data[:,1]
d_vals = data[:,3]
# build an array indexed by b values
is_ok = np.zeros(np.max(b_vals) + 1, dtype=np.bool_)
np.logical_or.at(is_ok, b_vals, d_vals)
# is_ok == array([ True, False, True, True, False], dtype=bool)
# take the rows which have a b value that was deemed OK
result = data[is_ok[b_vals]]
np.logical_or.at(is_ok, b_vals, d_vals)
is a more efficient version of: np.logical_or.at(is_ok, b_vals, d_vals)
是一个更有效的版本:
for idx, val in zip(b_vals, d_vals):
is_ok[idx] = np.logical_or(is_ok[idx], val)
Untested since in a hurry, but this should work: 从匆忙中未经测试,但这应该工作:
import numpy_indexed as npi
g = npi.group_by(data[:, 1])
ids, valid = g.any(data[:, 3])
result = data[valid[g.inverse]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.