简体   繁体   English

根据 pandas 中的年份和类别创建虚拟变量

[英]Creating dummy variable depending on year and category in pandas

this is my first time posting a question here, so please let me know if my question is lacking anyway.这是我第一次在这里发布问题,所以如果我的问题仍然缺乏,请告诉我。

Let's say I have the following dataframe where "Value" contains only integer 1 or 2. Basically, I want to create a column("Desired") with a dummy variable where 1 happens when firm has remained with Value=1 since the beginning of its appearance.假设我有以下 dataframe 其中“值”仅包含 integer 1 或 2。基本上,我想创建一个带有虚拟变量的列(“所需”),其中 1 发生在公司从一开始就保持值 = 1 时它的外观。 Once the firm has Value=2, the dummy variable should be 0 even if the firm reverts back to Value=1.一旦公司的 Value=2,即使公司恢复到 Value=1,虚拟变量也应该为 0。

Firm_ID公司_ID Year Value价值 Desired期望的
0000001 0000001 2000 2000 1 1 1 1
0000001 0000001 2001 2001年 1 1 1 1
0000001 0000001 2002 2002年 2 2 0 0
0000001 0000001 2003 2003年 2 2 0 0
0000001 0000001 2004 2004年 1 1 0 0
0000001 0000001 2005 2005年 1 1 0 0
0000002 0000002 2000 2000 2 2 0 0
0000002 0000002 2001 2001年 2 2 0 0
0000002 0000002 2002 2002年 2 2 0 0
0000003 0000003 2000 2000 1 1 1 1
0000003 0000003 2001 2001年 1 1 1 1
0000003 0000003 2002 2002年 1 1 1 1
0000003 0000003 2003 2003年 1 1 1 1
d = {'firm_id': ["0000001" , "0000001","0000001","0000001","0000001","0000001","0000002","0000002","0000002","0000003",
                "0000003","0000003","0000003"], 
     'year': [2000,2001,2002,2003,2004,2005,2000,2001,2002,2000,2001,2002,2003],
    'Value':[1,1,2,2,1,1,2,2,2,1,1,1,1]}
df = pd.DataFrame(data=d)

Currently, the code I am running is the following.目前,我正在运行的代码如下。

for i in range(df.shape[0]):
    firm = df.loc[i,'firm_id']
    year = df.loc[i,'year']
    temp_df = df[df['firm_id']==firm]
    
    if (temp_df.groupby(['year']).max()[['Value']].max() == 2)[0]: # At some point this firm becomes Value==2
        
        # Get Earliest Year of becoming Value==2
        year_df = temp_df.groupby(['year']).max()[['Value']]
        ch_year = year_df[year_df['Value']==2].index.min() # Year the firm becomes Value==2
        
        if year >= ch_year :
            df.loc[i,'Desired'] = 0
        else : 
            df.loc[i,'Desired'] = 1
    else :# They always remain Value==1
         df.loc[i,'Desired']=1 
            

However, this code is taking too long for the size of my current dataframe.但是,对于我当前 dataframe 的大小,此代码花费的时间太长。 Is there a more efficient way of code that I could use?我可以使用更有效的代码方式吗?

# df = df.sort_values(["firm_id", "year"])
df.Value.ne(1).groupby(df.firm_id).cumsum().lt(1).astype(int)
#     Value
# 0    1
# 1    1
# 2    0
# 3    0
# 4    0
# 5    0
# 6    0
# 7    0
# 8    0
# 9    1
# 10   1
# 11   1
# 12   1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM