[英]MultiIndex DataFrame: How to create a new column based on values in other column?
我有一個不平衡的Pandas MultiIndex DataFrame,其中每一行都存儲一個firm-year
觀察值。 采樣期(可變year
)的范圍為2013年至2017年。數據集包含可變event
,如果在給定year
發生event
,則將其設置為1
。
樣本數據集:
#Create dataset
import pandas as pd
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)
我想根據現有的列event
創建一個新的列status
,如下所示:每當事件在列event
第一次發生時, status
列的值在隨后的所有年份(包括事件發生的年份)都應從0
更改為1
。 )。
具有預期變量status
DataFrame:
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
到目前為止,我還沒有找到任何有用的解決方案,因此我們將不勝感激。 謝謝!
我們可以在索引(id)的第一級進行groupby
,然后標記所有eq
的行。 然后使用cumsum
,它還將True
轉換為1
,將False
轉換為0
:
df['status'] = df.groupby(level=0).apply(lambda x: x.eq(1).cumsum())
產量
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
關鍵是在groupby
下使用cumsum
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
(df.assign(status = lambda x: x.event.eq(1).mul(1).groupby(x['id']).cumsum())
.set_index(['id','year']))
產量
event status
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
基本答案及相關段落說明:
import pandas as pd
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,5],
'year' : [2013,2014,2015,2016,2017,2014,2015,2016,2017,
2016,2017,2013,2014,2015,2014,2015,2016,2017],
'event' : [1,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1]})
# extract unique IDs as list
ids = list(set(df["id"]))
# initialize a list to keep the results
list_event_years =[]
#open a loop on IDs
for id in ids :
# set happened to 0
event_happened = 0
# open a loop on DF pertaining to the actual ID
for index, row in df[df["id"] == id].iterrows() :
# if event happened set the variable to 1
if row["event"] == 1 :
event_happened = 1
# add the var to the list of results
list_event_years.append(event_happened)
# add the list of results as DF column
df["event-happened"] = list_event_years
### OUTPUT
>>> df
id year event event-year
0 1 2013 1 1
1 1 2014 0 1
2 1 2015 0 1
3 1 2016 0 1
4 1 2017 0 1
5 2 2014 0 0
6 2 2015 0 0
7 2 2016 1 1
8 2 2017 0 1
9 3 2016 1 1
10 3 2017 0 1
11 4 2013 0 0
12 4 2014 1 1
13 4 2015 0 1
14 5 2014 0 0
15 5 2015 0 0
16 5 2016 0 0
17 5 2017 1 1
如果您需要像示例中那樣對它們進行索引,請執行以下操作:
df.set_index(['id', 'year'], inplace = True)
df.sort_index(inplace = True)
### OUTPUT
>>> df
event event-year
id year
1 2013 1 1
2014 0 1
2015 0 1
2016 0 1
2017 0 1
2 2014 0 0
2015 0 0
2016 1 1
2017 0 1
3 2016 1 1
2017 0 1
4 2013 0 0
2014 1 1
2015 0 1
5 2014 0 0
2015 0 0
2016 0 0
2017 1 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.