簡體   English   中英

如何創建指示符列以指示數據框中先前條目的特定更改,其中按 ID 對其進行排序和分組?

[英]How to create an indicator column to indicate specific change from a previous entry in a dataframe where it's sorted and grouped by ID?

我有一個客戶端CLIENT_ID的數據CLIENT_ID ,如下所示:

CLIENT_ID CURRENT_DATE_STATUS 地位
10002 2017-07-21 開始
10002 2017-07-21 開始
10002 2018-07-01 攪動
10002 2018-07-01 攪動
10002 2019-01-01 重啟
11811 2019-08-15 開始
11811 2019-08-15 開始
11811 2019-12-31 重啟
22101 2020-03-11 開始
22101 2020-03-11 開始
22101 2020-03-11 開始
22101 2020-11-01 攪動
22300 2018-05-06 開始
22300 2018-05-06 開始

數據幀按CLIENT_ID and CURRENT_DATE_STATUS排序。 如何創建指示Boolean 1 or 0列的指標:

  • 如果每個CLIENT_ID的先前STATUS條目已更改為CHURNED or RESTARTED

結果數據框如下所示:

CLIENT_ID CURRENT_DATE_STATUS 地位 停止
10002 2017-07-21 開始 0
10002 2017-07-21 開始 0
10002 2018-07-01 攪動 1
10002 2018-07-01 攪動 0
10002 2019-01-01 重啟 1
11811 2019-08-15 開始 0
11811 2019-08-15 開始 0
11811 2019-12-31 重啟 1
22101 2020-03-11 開始 0
22101 2020-03-11 開始 0
22101 2020-03-11 開始 0
22101 2020-11-01 攪動 1
22300 2018-05-06 開始 0
22300 2018-05-06 開始 0

這是生成數據框的代碼

import pandas as pd

data = {'CLIENT_ID':[10002,10002,10002,10002,10002,11811,11811,11811,22101,22101,22101,22101,22300,22300],
'CURRENT_DATE_STATUS':['2017-07-21','2017-07-21','2018-07-01','2018-07-01','2019-07-01','2019-08-15','2019-08-15','2019-12-31','2020-03-11','2020-03-11','2020-03-11','2020-11-01','2018-05-06','2018-05-06'],
'STATUS':['STARTED','STARTED','CHURNED','CHURNED','RESTARTED','STARTED','STARTED','RESTARTED','STARTED','STARTED','STARTED','CHURNED','STARTED','STARTED']}
df = pd.DataFrame(data)

您可以通過Series.eq比較 eqaul 的實際值,通過Series.eq將每個組的DataFrameGroupBy.shift進行Series.ne ,對於不相等的Series.ne ,通過&為按位AND和最后一個鏈為| 對於按位OR轉換為整數:

s = df.groupby('CLIENT_ID')['STATUS'].shift()
m1 = df['STATUS'].eq('RESTARTED') & s.ne('RESTARTED')
m2 = df['STATUS'].eq('CHURNED') & s.ne('CHURNED')

df['STOPPED'] = (m1 | m2).astype(int)
print (df)
    CLIENT_ID CURRENT_DATE_STATUS     STATUS  STOPPED
0       10002          2017-07-21    STARTED        0
1       10002          2017-07-21    STARTED        0
2       10002          2018-07-01    CHURNED        1
3       10002          2018-07-01    CHURNED        0
4       10002          2019-07-01  RESTARTED        1
5       11811          2019-08-15    STARTED        0
6       11811          2019-08-15    STARTED        0
7       11811          2019-12-31  RESTARTED        1
8       22101          2020-03-11    STARTED        0
9       22101          2020-03-11    STARTED        0
10      22101          2020-03-11    STARTED        0
11      22101          2020-11-01    CHURNED        1
12      22300          2018-05-06    STARTED        0
13      22300          2018-05-06    STARTED        0

另一種解決方案是比較前一個移位的值,然后如果按Series.isin中的列表匹配,最后一個鏈按&為按位AND

m3 = df.groupby('CLIENT_ID')['STATUS'].shift().ne(df['STATUS'])
m4 = df['STATUS'].isin(["CHURNED", "RESTARTED"])

df['STOPPED'] = (m3 & m4).astype(int)
print (df)

    CLIENT_ID CURRENT_DATE_STATUS     STATUS  STOPPED
0       10002          2017-07-21    STARTED        0
1       10002          2017-07-21    STARTED        0
2       10002          2018-07-01    CHURNED        1
3       10002          2018-07-01    CHURNED        0
4       10002          2019-07-01  RESTARTED        1
5       11811          2019-08-15    STARTED        0
6       11811          2019-08-15    STARTED        0
7       11811          2019-12-31  RESTARTED        1
8       22101          2020-03-11    STARTED        0
9       22101          2020-03-11    STARTED        0
10      22101          2020-03-11    STARTED        0
11      22101          2020-11-01    CHURNED        1
12      22300          2018-05-06    STARTED        0
13      22300          2018-05-06    STARTED        0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM