I have a boolean column in a data frame. In my case, n is 4, so if True appears less than 4 times in a row I want to set these True value to False. The following code can pull that off:
example_data = [False,False,False,False,True,True,False,False,True,False,False,
False,True,True,True,False,False,False,True,True,True,True,
True,False]
import pandas as pd
df = pd.DataFrame(example_data,columns=["input"])
# At the beginning the output is equal to the input.
df["output"] = df["input"]
# This counter will count how often a True apeard in a row.
true_count = 0
# The smalest number of True's that have to appear in a row to keep them.
n = 4
for index, row in df.iterrows():
# If the current value is True the true_counter is increased.
if row["input"] == True:
true_count += 1
# If the value is false and the previous value was false as well nothing.
# will happen.
elif true_count == 0:
pass
# If the true_count is smaler than n starting from the previous input
# the number of previous True's are set to false depending on the
# true_count. After that the true_count is reset to 0.
elif true_count < n:
for i in range(0,true_count):
df._set_value(index-(i+1),"output",False)
true_count = 0
# In case the true_count is bigger n or greater it is simply reset to 0.
else:
true_count = 0
The data frame will look something like this:
input output
0 False False
1 False False
2 False False
3 False False
4 True False
5 True False
6 False False
7 False False
8 True False
9 False False
10 False False
11 False False
12 True False
13 True False
14 True False
15 False False
16 False False
17 False False
18 True True
19 True True
20 True True
21 True True
22 True True
23 False False
My question is if there is a more "pandas" way to do this, as iterating over the data is quite slow. I thought about some functionality that uses given sequences like for example False, True, True, True, False
to replace them, but I didn't found anything like that.
Thanks in advance for any helpful answer.
Idea is create groups for each consecutive True
s values by Series.cumsum
with inverted boolean mask, then replace non match values to NaN
s by Series.where
and last count values of each groups by Series.map
and Series.value_counts
compared by threshold for greater by Series.gt
:
s = (~df['input']).cumsum().where(df['input'])
df['out'] = s.map(s.value_counts()).gt(4)
print (df)
input output out
0 False False False
1 False False False
2 False False False
3 False False False
4 True False False
5 True False False
6 False False False
7 False False False
8 True False False
9 False False False
10 False False False
11 False False False
12 True False False
13 True False False
14 True False False
15 False False False
16 False False False
17 False False False
18 True True True
19 True True True
20 True True True
21 True True True
22 True True True
23 False False False
Details :
s = (~df['input']).cumsum().where(df['input'])
print (df.assign(inv = (~df['input']),
cumsum = (~df['input']).cumsum(),
s = (~df['input']).cumsum().where(df['input']),
count = s.map(s.value_counts()),
out = s.map(s.value_counts()).gt(4)))
input output inv cumsum s count out
0 False False True 1 NaN NaN False
1 False False True 2 NaN NaN False
2 False False True 3 NaN NaN False
3 False False True 4 NaN NaN False
4 True False False 4 4.0 2.0 False
5 True False False 4 4.0 2.0 False
6 False False True 5 NaN NaN False
7 False False True 6 NaN NaN False
8 True False False 6 6.0 1.0 False
9 False False True 7 NaN NaN False
10 False False True 8 NaN NaN False
11 False False True 9 NaN NaN False
12 True False False 9 9.0 3.0 False
13 True False False 9 9.0 3.0 False
14 True False False 9 9.0 3.0 False
15 False False True 10 NaN NaN False
16 False False True 11 NaN NaN False
17 False False True 12 NaN NaN False
18 True True False 12 12.0 5.0 True
19 True True False 12 12.0 5.0 True
20 True True False 12 12.0 5.0 True
21 True True False 12 12.0 5.0 True
22 True True False 12 12.0 5.0 True
23 False False True 13 NaN NaN False
Here's a way to do that:
N = 4
df["group_size"] = df.assign(group = (df.input==False).cumsum()).groupby("group").transform("count")
df.loc[(df.group_size > N) & df.input, "output"] = True
df.output.fillna(False, inplace = True)
The output is (note that the group size is always the actual size + 1) - but the final result is fine:
input group_size output
0 False 1 False
1 False 1 False
2 False 1 False
3 False 3 False
4 True 3 False
5 True 3 False
6 False 1 False
7 False 2 False
8 True 2 False
9 False 1 False
10 False 1 False
11 False 4 False
12 True 4 False
13 True 4 False
14 True 4 False
15 False 1 False
16 False 1 False
17 False 6 False
18 True 6 True
19 True 6 True
20 True 6 True
21 True 6 True
22 True 6 True
23 False 1 False
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.