[英]Creating dummy variable depending on year and category in pandas
this is my first time posting a question here, so please let me know if my question is lacking anyway.这是我第一次在这里发布问题,所以如果我的问题仍然缺乏,请告诉我。
Let's say I have the following dataframe where "Value" contains only integer 1 or 2. Basically, I want to create a column("Desired") with a dummy variable where 1 happens when firm has remained with Value=1 since the beginning of its appearance.假设我有以下 dataframe 其中“值”仅包含 integer 1 或 2。基本上,我想创建一个带有虚拟变量的列(“所需”),其中 1 发生在公司从一开始就保持值 = 1 时它的外观。 Once the firm has Value=2, the dummy variable should be 0 even if the firm reverts back to Value=1.
一旦公司的 Value=2,即使公司恢复到 Value=1,虚拟变量也应该为 0。
Firm_ID![]() |
Year![]() |
Value![]() |
Desired![]() |
---|---|---|---|
0000001 ![]() |
2000 ![]() |
1 ![]() |
1 ![]() |
0000001 ![]() |
2001 ![]() |
1 ![]() |
1 ![]() |
0000001 ![]() |
2002 ![]() |
2 ![]() |
0 ![]() |
0000001 ![]() |
2003 ![]() |
2 ![]() |
0 ![]() |
0000001 ![]() |
2004 ![]() |
1 ![]() |
0 ![]() |
0000001 ![]() |
2005 ![]() |
1 ![]() |
0 ![]() |
0000002 ![]() |
2000 ![]() |
2 ![]() |
0 ![]() |
0000002 ![]() |
2001 ![]() |
2 ![]() |
0 ![]() |
0000002 ![]() |
2002 ![]() |
2 ![]() |
0 ![]() |
0000003 ![]() |
2000 ![]() |
1 ![]() |
1 ![]() |
0000003 ![]() |
2001 ![]() |
1 ![]() |
1 ![]() |
0000003 ![]() |
2002 ![]() |
1 ![]() |
1 ![]() |
0000003 ![]() |
2003 ![]() |
1 ![]() |
1 ![]() |
d = {'firm_id': ["0000001" , "0000001","0000001","0000001","0000001","0000001","0000002","0000002","0000002","0000003",
"0000003","0000003","0000003"],
'year': [2000,2001,2002,2003,2004,2005,2000,2001,2002,2000,2001,2002,2003],
'Value':[1,1,2,2,1,1,2,2,2,1,1,1,1]}
df = pd.DataFrame(data=d)
Currently, the code I am running is the following.目前,我正在运行的代码如下。
for i in range(df.shape[0]):
firm = df.loc[i,'firm_id']
year = df.loc[i,'year']
temp_df = df[df['firm_id']==firm]
if (temp_df.groupby(['year']).max()[['Value']].max() == 2)[0]: # At some point this firm becomes Value==2
# Get Earliest Year of becoming Value==2
year_df = temp_df.groupby(['year']).max()[['Value']]
ch_year = year_df[year_df['Value']==2].index.min() # Year the firm becomes Value==2
if year >= ch_year :
df.loc[i,'Desired'] = 0
else :
df.loc[i,'Desired'] = 1
else :# They always remain Value==1
df.loc[i,'Desired']=1
However, this code is taking too long for the size of my current dataframe.但是,对于我当前 dataframe 的大小,此代码花费的时间太长。 Is there a more efficient way of code that I could use?
我可以使用更有效的代码方式吗?
# df = df.sort_values(["firm_id", "year"])
df.Value.ne(1).groupby(df.firm_id).cumsum().lt(1).astype(int)
# Value
# 0 1
# 1 1
# 2 0
# 3 0
# 4 0
# 5 0
# 6 0
# 7 0
# 8 0
# 9 1
# 10 1
# 11 1
# 12 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.