矢量化的Python代码，用于迭代和更改窗口中Pandas DataFrame的每一列

Question

I have a dataframe of ones and zeros. 我有一个1和0的数据框。 I iterate over each column with a loop. 我用循环遍历每列。 If I get a one at an iteration, I should keep it in the column. 如果迭代得到一个，则应将其保留在该列中。 But if in the next n positions after this one there are some ones, I should turn them into zeros. 但是，如果在此位置之后的下n位置中有一些位置，我应该将它们变为零。 Then repeat the same up to the end of the column, and then repeat all this on each column. 然后重复相同的操作，直到列的末尾，然后在每列上重复所有这些操作。

Is it possible to get rid of the loop and vectorize everything with dataframe/matrix/array operations in pandas/numpy? 是否有可能摆脱循环并使用pandas / numpy中的dataframe / matrix / array操作向量化所有内容？ And how should I go about it? 我应该怎么做呢？ n could be anywhere from 2 to 100. n可以在2到100之间。

I tried this function, but failed, it only keeps ones if there are at least n zeros between them which is obviously not what I need: 我尝试了此函数，但失败了，只有在它们之间至少有n零时，它才保留一个，这显然不是我所需要的：

def clear_window(df, n):

    # create buffer of size n
    pad = pd.DataFrame(np.zeros([n, df.shape[1]]),
                       columns=df.columns)
    padded_df = pd.concat([pad, df])

    # compute rolling sum and cut off the buffer
    roll = (padded_df
            .rolling(n+1)
            .sum()
            .iloc[n:, :]
           )

    # delete ones where rolling sum is above 1 or below -1
    result = df * ((roll == 1.0) | (roll == -1.0)).astype(int)

    return result

Answer 1

Numba will get you speed with these sequential looping problems if you can't find a way to vectorize. 如果您找不到向量化的方法，Numba将使您更快地解决这些顺序循环问题。

This code loops through every row looking for a target value. 这段代码遍历每一行以寻找目标值。 When a target value (1) is found, the next n rows are set to the fill value (0). 找到目标值（1）时，接下来的n行将设置为填充值（0）。 The search row index is incremented to skip over the fill rows and the next search is begun. 搜索行索引增加，以跳过填充行，并开始下一个搜索。

from numba import jit

@jit(nopython=True)
def find_and_fill(arr, span, tgt_val=1, fill_val=0):
    start_idx = 0
    end_idx = arr.size
    while start_idx < end_idx:
        if arr[start_idx] == tgt_val:
            arr[start_idx + 1 : start_idx + 1 + span] = fill_val
            start_idx = start_idx + 1 + span
        else:
            start_idx = start_idx + 1
    return arr

df2 = df.copy()
# get the dataframe values into a numpy array
a = df2.values

# transpose and run the function for each column of the dataframe
for col in a.T:
    # fill span is set to 6 in this example
    col = find_and_fill(col, 6)

# assign the array back to the dataframe
df2[list(df2.columns)] = a

# df2 now contains the result values

矢量化的Python代码，用于迭代和更改窗口中Pandas DataFrame的每一列

问题描述

1 个解决方案

解决方案1
0 2018-09-26 22:09:17

矢量化的Python代码，用于迭代和更改窗口中Pandas DataFrame的每一列

问题描述

1 个解决方案

解决方案1 0 2018-09-26 22:09:17

解决方案1
0 2018-09-26 22:09:17