简体   繁体   中英

Pandas: Setting True to False in a column, if it appears less than n times in a row

I have a boolean column in a data frame. In my case, n is 4, so if True appears less than 4 times in a row I want to set these True value to False. The following code can pull that off:

example_data = [False,False,False,False,True,True,False,False,True,False,False,
                False,True,True,True,False,False,False,True,True,True,True,
                True,False]

import pandas as pd

df = pd.DataFrame(example_data,columns=["input"])

# At the beginning the output is equal to the input.
df["output"] = df["input"]

# This counter will count how often a True apeard in a row.
true_count = 0

# The smalest number of True's that have to appear in a row to keep them.
n = 4

for index, row in df.iterrows():

    # If the current value is True the true_counter is increased.
    if row["input"] == True:
        true_count += 1

    # If the value is false and the previous value was false as well nothing. 
    # will happen.
    elif true_count == 0:
        pass

    # If the true_count is smaler than n starting from the previous input 
    # the number of previous True's are set to false depending on the 
    # true_count. After that the true_count is reset to 0.
    elif true_count < n:
        for i in range(0,true_count):
            df._set_value(index-(i+1),"output",False)
        true_count = 0

    # In case the true_count is bigger n or greater it is simply reset to 0.
    else:
        true_count = 0

The data frame will look something like this:

    input  output
0   False   False
1   False   False
2   False   False
3   False   False
4    True   False
5    True   False
6   False   False
7   False   False
8    True   False
9   False   False
10  False   False
11  False   False
12   True   False
13   True   False
14   True   False
15  False   False
16  False   False
17  False   False
18   True    True
19   True    True
20   True    True
21   True    True
22   True    True
23  False   False

My question is if there is a more "pandas" way to do this, as iterating over the data is quite slow. I thought about some functionality that uses given sequences like for example False, True, True, True, False to replace them, but I didn't found anything like that.

Thanks in advance for any helpful answer.

Idea is create groups for each consecutive True s values by Series.cumsum with inverted boolean mask, then replace non match values to NaN s by Series.where and last count values of each groups by Series.map and Series.value_counts compared by threshold for greater by Series.gt :

s = (~df['input']).cumsum().where(df['input'])

df['out'] = s.map(s.value_counts()).gt(4)
print (df)
    input  output    out
0   False   False  False
1   False   False  False
2   False   False  False
3   False   False  False
4    True   False  False
5    True   False  False
6   False   False  False
7   False   False  False
8    True   False  False
9   False   False  False
10  False   False  False
11  False   False  False
12   True   False  False
13   True   False  False
14   True   False  False
15  False   False  False
16  False   False  False
17  False   False  False
18   True    True   True
19   True    True   True
20   True    True   True
21   True    True   True
22   True    True   True
23  False   False  False

Details :

s = (~df['input']).cumsum().where(df['input'])
print (df.assign(inv = (~df['input']),
                 cumsum = (~df['input']).cumsum(),
                 s = (~df['input']).cumsum().where(df['input']),
                 count = s.map(s.value_counts()),
                 out = s.map(s.value_counts()).gt(4)))
       
    input  output    inv  cumsum     s  count    out
0   False   False   True       1   NaN    NaN  False
1   False   False   True       2   NaN    NaN  False
2   False   False   True       3   NaN    NaN  False
3   False   False   True       4   NaN    NaN  False
4    True   False  False       4   4.0    2.0  False
5    True   False  False       4   4.0    2.0  False
6   False   False   True       5   NaN    NaN  False
7   False   False   True       6   NaN    NaN  False
8    True   False  False       6   6.0    1.0  False
9   False   False   True       7   NaN    NaN  False
10  False   False   True       8   NaN    NaN  False
11  False   False   True       9   NaN    NaN  False
12   True   False  False       9   9.0    3.0  False
13   True   False  False       9   9.0    3.0  False
14   True   False  False       9   9.0    3.0  False
15  False   False   True      10   NaN    NaN  False
16  False   False   True      11   NaN    NaN  False
17  False   False   True      12   NaN    NaN  False
18   True    True  False      12  12.0    5.0   True
19   True    True  False      12  12.0    5.0   True
20   True    True  False      12  12.0    5.0   True
21   True    True  False      12  12.0    5.0   True
22   True    True  False      12  12.0    5.0   True
23  False   False   True      13   NaN    NaN  False

Here's a way to do that:

N = 4 

df["group_size"] = df.assign(group = (df.input==False).cumsum()).groupby("group").transform("count")
df.loc[(df.group_size > N) & df.input, "output"] = True
df.output.fillna(False, inplace = True)

The output is (note that the group size is always the actual size + 1) - but the final result is fine:

    input  group_size  output
0   False           1   False
1   False           1   False
2   False           1   False
3   False           3   False
4    True           3   False
5    True           3   False
6   False           1   False
7   False           2   False
8    True           2   False
9   False           1   False
10  False           1   False
11  False           4   False
12   True           4   False
13   True           4   False
14   True           4   False
15  False           1   False
16  False           1   False
17  False           6   False
18   True           6    True
19   True           6    True
20   True           6    True
21   True           6    True
22   True           6    True
23  False           1   False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM