![](/img/trans.png)
[英]How to create an indicator column of the first occurrence of a variable of groupby ID sorted by date?
[英]How to create an indicator column to indicate specific change from a previous entry in a dataframe where it's sorted and grouped by ID?
我有一個客戶端CLIENT_ID
的數據CLIENT_ID
,如下所示:
CLIENT_ID | CURRENT_DATE_STATUS | 地位 |
---|---|---|
10002 | 2017-07-21 | 開始 |
10002 | 2017-07-21 | 開始 |
10002 | 2018-07-01 | 攪動 |
10002 | 2018-07-01 | 攪動 |
10002 | 2019-01-01 | 重啟 |
11811 | 2019-08-15 | 開始 |
11811 | 2019-08-15 | 開始 |
11811 | 2019-12-31 | 重啟 |
22101 | 2020-03-11 | 開始 |
22101 | 2020-03-11 | 開始 |
22101 | 2020-03-11 | 開始 |
22101 | 2020-11-01 | 攪動 |
22300 | 2018-05-06 | 開始 |
22300 | 2018-05-06 | 開始 |
數據幀按CLIENT_ID and CURRENT_DATE_STATUS
排序。 如何創建指示Boolean 1 or 0
列的指標:
CLIENT_ID
的先前STATUS
條目已更改為CHURNED or RESTARTED
。結果數據框如下所示:
CLIENT_ID | CURRENT_DATE_STATUS | 地位 | 停止 |
---|---|---|---|
10002 | 2017-07-21 | 開始 | 0 |
10002 | 2017-07-21 | 開始 | 0 |
10002 | 2018-07-01 | 攪動 | 1 |
10002 | 2018-07-01 | 攪動 | 0 |
10002 | 2019-01-01 | 重啟 | 1 |
11811 | 2019-08-15 | 開始 | 0 |
11811 | 2019-08-15 | 開始 | 0 |
11811 | 2019-12-31 | 重啟 | 1 |
22101 | 2020-03-11 | 開始 | 0 |
22101 | 2020-03-11 | 開始 | 0 |
22101 | 2020-03-11 | 開始 | 0 |
22101 | 2020-11-01 | 攪動 | 1 |
22300 | 2018-05-06 | 開始 | 0 |
22300 | 2018-05-06 | 開始 | 0 |
這是生成數據框的代碼
import pandas as pd
data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)
您可以通過Series.eq
比較 eqaul 的實際值,通過Series.eq
將每個組的DataFrameGroupBy.shift
進行Series.ne
,對於不相等的Series.ne
,通過&
為按位AND
和最后一個鏈為|
對於按位OR
轉換為整數:
s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')
df['STOPPED'] = (m1 | m2).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
另一種解決方案是比較前一個移位的值,然后如果按Series.isin
中的列表匹配,最后一個鏈按&
為按位AND
:
m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])
df['STOPPED'] = (m3 & m4).astype(int)
print (df)
CLIENT_ID CURRENT_DATE_STATUS STATUS STOPPED
0 10002 2017-07-21 STARTED 0
1 10002 2017-07-21 STARTED 0
2 10002 2018-07-01 CHURNED 1
3 10002 2018-07-01 CHURNED 0
4 10002 2019-07-01 RESTARTED 1
5 11811 2019-08-15 STARTED 0
6 11811 2019-08-15 STARTED 0
7 11811 2019-12-31 RESTARTED 1
8 22101 2020-03-11 STARTED 0
9 22101 2020-03-11 STARTED 0
10 22101 2020-03-11 STARTED 0
11 22101 2020-11-01 CHURNED 1
12 22300 2018-05-06 STARTED 0
13 22300 2018-05-06 STARTED 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.