简体   繁体   中英

Vectorized Python code for iterating and changing each column of a Pandas DataFrame within a window

I have a dataframe of ones and zeros. I iterate over each column with a loop. If I get a one at an iteration, I should keep it in the column. But if in the next n positions after this one there are some ones, I should turn them into zeros. Then repeat the same up to the end of the column, and then repeat all this on each column.

Is it possible to get rid of the loop and vectorize everything with dataframe/matrix/array operations in pandas/numpy? And how should I go about it? n could be anywhere from 2 to 100.

I tried this function, but failed, it only keeps ones if there are at least n zeros between them which is obviously not what I need:

def clear_window(df, n):

    # create buffer of size n
    pad = pd.DataFrame(np.zeros([n, df.shape[1]]),
                       columns=df.columns)
    padded_df = pd.concat([pad, df])

    # compute rolling sum and cut off the buffer
    roll = (padded_df
            .rolling(n+1)
            .sum()
            .iloc[n:, :]
           )

    # delete ones where rolling sum is above 1 or below -1
    result = df * ((roll == 1.0) | (roll == -1.0)).astype(int)

    return result

Numba will get you speed with these sequential looping problems if you can't find a way to vectorize.

This code loops through every row looking for a target value. When a target value (1) is found, the next n rows are set to the fill value (0). The search row index is incremented to skip over the fill rows and the next search is begun.

from numba import jit

@jit(nopython=True)
def find_and_fill(arr, span, tgt_val=1, fill_val=0):
    start_idx = 0
    end_idx = arr.size
    while start_idx < end_idx:
        if arr[start_idx] == tgt_val:
            arr[start_idx + 1 : start_idx + 1 + span] = fill_val
            start_idx = start_idx + 1 + span
        else:
            start_idx = start_idx + 1
    return arr

df2 = df.copy()
# get the dataframe values into a numpy array
a = df2.values

# transpose and run the function for each column of the dataframe
for col in a.T:
    # fill span is set to 6 in this example
    col = find_and_fill(col, 6)

# assign the array back to the dataframe
df2[list(df2.columns)] = a

# df2 now contains the result values

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM