[英]Pandas add column on condition: If value of cell is True set value of largest number in Period to true
I have a pandas dataframe with lets say two columns, for example:我有一个 Pandas 数据框,可以说两列,例如:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria: I specify a period, for this example period = 4. Now I take a look at all rows where boolean == 1. new_boolean will be 1 for the maximum value in the last period rows.现在我想添加具有以下条件的第三列 (new_boolean):我指定一个句点,对于此示例,句点 = 4。现在我查看所有布尔值 == 1 的行。对于最大值,new_boolean 将为 1在最后一期行中。
For example I have boolean == 1 for row 2. So I look at the last period rows.例如,我的第 2 行有 boolean == 1。所以我查看最后一期的行。 The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
值为 [1, 5],5 是最大值,因此第 2 行中 new_boolean 的值为 1。
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1第二个示例:第 8 行(值 = 7):我得到值 [7, 4, 12, 9],12 是最大值,因此值为 12 的行中 new_boolean 的值将为 1
result:结果:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?我怎样才能在算法上做到这一点?
Use df.index
with df.iloc
and df.idxmax
:将
df.index
与df.iloc
和df.idxmax
:
In [182]: period = 4 # Define period to 4
In [183]: ix = df[df.boolean.eq(1)].index # Create a list of indexes where boolean = 1
In [213]: new_bool_ix = [] # empty list
# For every index in `ix`, take the last 4 rows and append the index of maximum `value`
In [215]: for i in ix:
...: new_bool_ix.append(df.iloc[:i + 1].iloc[-period:]['value'].idxmax())
...:
In [225]: df['new_boolean'] = 0 # declare column new_boolean with default value `0`
In [227]: df.loc[new_bool_ix, 'new_boolean'] = 1 # Change the value to 1 for the indexes in new_bool_ix
In [228]: df
Out[228]:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
Compute the rolling max of the 'value' column计算“值”列的滚动最大值
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, ie where 'boolean' = 1仅选择相关值,即其中 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
'new_boolean' = 1 的行是 'value' 属于
on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:我分两步完成,但我认为解决方案更清晰:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:结果:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.