简体   繁体   English

从具有索引的数组填充1D numpy数组

[英]Fill 1D numpy array from arrays with indices

Background 背景

I have one 1D NumPy array initialized with zeroes. 我有一个用零初始化的1D NumPy数组。

import numpy as np
section = np.zeros(1000)

Then I have a Pandas DataFrame where I have indices in two columns: 然后我有一个Pandas DataFrame,其中我有两列索引:

d= {'start': {0: 7200, 1: 7500, 2: 7560, 3: 8100, 4: 11400},
    'end': {0: 10800, 1: 8100, 2: 8100, 3: 8150, 4: 12000}}

df = pd.DataFrame(data=d, columns=['start', 'end'])

For each pair of indices, I want to set the value of the corresponding indices in the numpy array to True. 对于每对索引,我想将numpy数组中相应索引的值设置为True。

My current solution 我目前的解决方案

I can do this by applying a function to the DataFrame: 我可以通过将函数应用于DataFrame来实现:

def fill_array(row):
    section[row.start:row.end] = True

df.apply(fill_array, axis=1)

I want to vectorize this operation 我想矢量化这个操作

This works as I expect, but for the fun of it, I would like to vectorize the operation. 这正如我所料,但为了它的乐趣,我想矢量化操作。 I'm not very proficient with this, and my searching online has not put me on the right track. 我对此并不十分熟悉,而且我在线搜索并没有让我走上正轨。

I would really appreciate any suggestions on how to make this into a vector operation, if at all possible. 如果可能的话,我真的很感激有关如何将其转换为矢量操作的任何建议。

The trick for the implementation to follow is that we would put 1s at every start points and -1s at every end points on a zeros initialized int array. 实现的技巧是我们在每个起始点放置1s ,在零初始化int数组的每个端点放置-1s The actual trick comes next, as we would cumulatively sum it, giving us non-zero numbers for the positions covered by the bin (start-stop pair) boundaries. 接下来是实际技巧,因为我们会累计求和,给出bin(起止 - 对)边界所覆盖的位置的非零数字。 So, the final step is to look for non-zeros for a final output as a boolean array. 因此,最后一步是为最终输出寻找非零作为布尔数组。 Thus, we would have two vectorized solutions, with their implementations shown below - 因此,我们将有两个矢量化解决方案,其实现如下所示 -

def filled_array(start, end, length):
    out = np.zeros((length), dtype=int)
    np.add.at(out,start,1)
    np.add.at(out,end,-1)
    return out.cumsum()>0

def filled_array_v2(start, end, length): #Using @Daniel's suggestion
    out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
    return out.cumsum().astype(bool)

Sample run - 样品运行 -

In [2]: start
Out[2]: array([ 4,  7,  5, 15])

In [3]: end
Out[3]: array([12, 12,  7, 17])

In [4]: out = filled_array(start, end, length=20)

In [7]: pd.DataFrame(out) # print as dataframe for easy verification
Out[7]: 
        0
0   False
1   False
2   False
3   False
4    True
5    True
6    True
7    True
8    True
9    True
10   True
11   True
12  False
13  False
14  False
15   True
16   True
17  False
18  False
19  False

Vectorization 矢量

You have already done the most important vectorization by using slice assignment, but you cannot fully vectorize this using slices since python does not support "multiple slices". 您已经使用切片赋值完成了最重要的矢量化,但由于python不支持“多切片”,因此无法使用切片完全向量化。

If you really badly want to use vectorization you can create an array with the "True" indices, like this 如果您真的非常想要使用矢量化,您可以使用“True”索引创建一个数组,就像这样

indices = np.r_[tuple(slice(row.start, row.end) for row in df.itertuples())]
section[indices] = True

But this will most likely be slower, since it creates a new temporary array with indices. 但这很可能会变慢,因为它会创建一个带索引的新临时数组。

Removing duplicate work 删除重复的工作

With that said you could gain some speed-ups by reducing duplicate work. 有了这个说,你可以通过减少重复工作获得一些加速。 Specifically, you can take the union of the ranges , giving you a set of disjoint sets. 具体来说,您可以使用范围的并集,为您提供一组不相交的集合。

In your case, the first interval overlaps all except the last one, so your dataframe is equivalent to 在您的情况下,第一个区间与最后一个区间重叠,因此您的数据帧相当于

d= {'start': {0: 7200, 1: 11400},
    'end': {0: 10800, 1: 12000}}

This reduces the amount of work by up to 60%! 这可以减少高达60%的工作量! But first we need to find these intervals. 但首先我们需要找到这些间隔。 Following the answer quoted above, we can do this by: 根据上面的答案,我们可以通过以下方式实现:

slices = [(row.start, row.end) for row in df.itertuples()]
slices_union = []
for start, end in sorted(slices):
    if slices_union and slices_union[-1][1] >= start - 1:
        slices_union[-1][1] = max(slices_union[-1][1], end)
    else:
        slices_union.append([start, end])

Then you can use these (hopefully much smaller slices) like this 然后你就可以使用这些(希望更小的切片)

for start, end in slices_union:
    section[start:end] = True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM