从具有索引的数组填充1D numpy数组

Question

Background 背景

I have one 1D NumPy array initialized with zeroes. 我有一个用零初始化的1D NumPy数组。

import numpy as np
section = np.zeros(1000)

Then I have a Pandas DataFrame where I have indices in two columns: 然后我有一个Pandas DataFrame，其中我有两列索引：

d= {'start': {0: 7200, 1: 7500, 2: 7560, 3: 8100, 4: 11400},
    'end': {0: 10800, 1: 8100, 2: 8100, 3: 8150, 4: 12000}}

df = pd.DataFrame(data=d, columns=['start', 'end'])

For each pair of indices, I want to set the value of the corresponding indices in the numpy array to True. 对于每对索引，我想将numpy数组中相应索引的值设置为True。

My current solution 我目前的解决方案

I can do this by applying a function to the DataFrame: 我可以通过将函数应用于DataFrame来实现：

def fill_array(row):
    section[row.start:row.end] = True

df.apply(fill_array, axis=1)

I want to vectorize this operation 我想矢量化这个操作

This works as I expect, but for the fun of it, I would like to vectorize the operation. 这正如我所料，但为了它的乐趣，我想矢量化操作。 I'm not very proficient with this, and my searching online has not put me on the right track. 我对此并不十分熟悉，而且我在线搜索并没有让我走上正轨。

I would really appreciate any suggestions on how to make this into a vector operation, if at all possible. 如果可能的话，我真的很感激有关如何将其转换为矢量操作的任何建议。

Answer 1

The trick for the implementation to follow is that we would put 1s at every start points and -1s at every end points on a zeros initialized int array. 实现的技巧是我们在每个起始点放置1s ，在零初始化int数组的每个端点放置-1s 。 The actual trick comes next, as we would cumulatively sum it, giving us non-zero numbers for the positions covered by the bin (start-stop pair) boundaries. 接下来是实际技巧，因为我们会累计求和，给出bin（起止 - 对）边界所覆盖的位置的非零数字。 So, the final step is to look for non-zeros for a final output as a boolean array. 因此，最后一步是为最终输出寻找非零作为布尔数组。 Thus, we would have two vectorized solutions, with their implementations shown below - 因此，我们将有两个矢量化解决方案，其实现如下所示 -

def filled_array(start, end, length):
    out = np.zeros((length), dtype=int)
    np.add.at(out,start,1)
    np.add.at(out,end,-1)
    return out.cumsum()>0

def filled_array_v2(start, end, length): #Using @Daniel's suggestion
    out =np.bincount(start, minlength=length) - np.bincount(end, minlength=length)
    return out.cumsum().astype(bool)

Sample run - 样品运行 -

In [2]: start
Out[2]: array([ 4,  7,  5, 15])

In [3]: end
Out[3]: array([12, 12,  7, 17])

In [4]: out = filled_array(start, end, length=20)

In [7]: pd.DataFrame(out) # print as dataframe for easy verification
Out[7]: 
        0
0   False
1   False
2   False
3   False
4    True
5    True
6    True
7    True
8    True
9    True
10   True
11   True
12  False
13  False
14  False
15   True
16   True
17  False
18  False
19  False

Answer 2

Vectorization 矢量

You have already done the most important vectorization by using slice assignment, but you cannot fully vectorize this using slices since python does not support "multiple slices". 您已经使用切片赋值完成了最重要的矢量化，但由于python不支持“多切片”，因此无法使用切片完全向量化。

If you really badly want to use vectorization you can create an array with the "True" indices, like this 如果您真的非常想要使用矢量化，您可以使用“True”索引创建一个数组，就像这样

indices = np.r_[tuple(slice(row.start, row.end) for row in df.itertuples())]
section[indices] = True

But this will most likely be slower, since it creates a new temporary array with indices. 但这很可能会变慢，因为它会创建一个带索引的新临时数组。

Removing duplicate work 删除重复的工作

With that said you could gain some speed-ups by reducing duplicate work. 有了这个说，你可以通过减少重复工作获得一些加速。 Specifically, you can take the union of the ranges , giving you a set of disjoint sets. 具体来说，您可以使用范围的并集，为您提供一组不相交的集合。

In your case, the first interval overlaps all except the last one, so your dataframe is equivalent to 在您的情况下，第一个区间与最后一个区间重叠，因此您的数据帧相当于

d= {'start': {0: 7200, 1: 11400},
    'end': {0: 10800, 1: 12000}}

This reduces the amount of work by up to 60%! 这可以减少高达60％的工作量！ But first we need to find these intervals. 但首先我们需要找到这些间隔。 Following the answer quoted above, we can do this by: 根据上面的答案，我们可以通过以下方式实现：

slices = [(row.start, row.end) for row in df.itertuples()]
slices_union = []
for start, end in sorted(slices):
    if slices_union and slices_union[-1][1] >= start - 1:
        slices_union[-1][1] = max(slices_union[-1][1], end)
    else:
        slices_union.append([start, end])

Then you can use these (hopefully much smaller slices) like this 然后你就可以使用这些（希望更小的切片）

for start, end in slices_union:
    section[start:end] = True

从具有索引的数组填充1D numpy数组

问题描述

2 个解决方案

解决方案1
5 已采纳 2017-07-12 12:43:57

解决方案2
1 2017-07-12 12:41:06

Vectorization 矢量

Removing duplicate work 删除重复的工作

从具有索引的数组填充1D numpy数组

问题描述

2 个解决方案

解决方案1 5 已采纳 2017-07-12 12:43:57

解决方案2 1 2017-07-12 12:41:06

Vectorization 矢量

Removing duplicate work 删除重复的工作

解决方案1
5 已采纳 2017-07-12 12:43:57

解决方案2
1 2017-07-12 12:41:06