根据 pandas 中的年份和类别创建虚拟变量

Question

this is my first time posting a question here, so please let me know if my question is lacking anyway.这是我第一次在这里发布问题，所以如果我的问题仍然缺乏，请告诉我。

Let's say I have the following dataframe where "Value" contains only integer 1 or 2. Basically, I want to create a column("Desired") with a dummy variable where 1 happens when firm has remained with Value=1 since the beginning of its appearance.假设我有以下 dataframe 其中“值”仅包含 integer 1 或 2。基本上，我想创建一个带有虚拟变量的列（“所需”），其中 1 发生在公司从一开始就保持值 = 1 时它的外观。 Once the firm has Value=2, the dummy variable should be 0 even if the firm reverts back to Value=1.一旦公司的 Value=2，即使公司恢复到 Value=1，虚拟变量也应该为 0。

Firm_ID公司_ID	Year年	Value价值	Desired期望的
0000001 0000001	2000 2000	1 1	1 1
0000001 0000001	2001 2001年	1 1	1 1
0000001 0000001	2002 2002年	2 2	0 0
0000001 0000001	2003 2003年	2 2	0 0
0000001 0000001	2004 2004年	1 1	0 0
0000001 0000001	2005 2005年	1 1	0 0
0000002 0000002	2000 2000	2 2	0 0
0000002 0000002	2001 2001年	2 2	0 0
0000002 0000002	2002 2002年	2 2	0 0
0000003 0000003	2000 2000	1 1	1 1
0000003 0000003	2001 2001年	1 1	1 1
0000003 0000003	2002 2002年	1 1	1 1
0000003 0000003	2003 2003年	1 1	1 1

d = {'firm_id': ["0000001" , "0000001","0000001","0000001","0000001","0000001","0000002","0000002","0000002","0000003",
                "0000003","0000003","0000003"], 
     'year': [2000,2001,2002,2003,2004,2005,2000,2001,2002,2000,2001,2002,2003],
    'Value':[1,1,2,2,1,1,2,2,2,1,1,1,1]}
df = pd.DataFrame(data=d)

Currently, the code I am running is the following.目前，我正在运行的代码如下。

for i in range(df.shape[0]):
    firm = df.loc[i,'firm_id']
    year = df.loc[i,'year']
    temp_df = df[df['firm_id']==firm]
    
    if (temp_df.groupby(['year']).max()[['Value']].max() == 2)[0]: # At some point this firm becomes Value==2
        
        # Get Earliest Year of becoming Value==2
        year_df = temp_df.groupby(['year']).max()[['Value']]
        ch_year = year_df[year_df['Value']==2].index.min() # Year the firm becomes Value==2
        
        if year >= ch_year :
            df.loc[i,'Desired'] = 0
        else : 
            df.loc[i,'Desired'] = 1
    else :# They always remain Value==1
         df.loc[i,'Desired']=1

However, this code is taking too long for the size of my current dataframe.但是，对于我当前 dataframe 的大小，此代码花费的时间太长。 Is there a more efficient way of code that I could use?我可以使用更有效的代码方式吗？

Answer 1

# df = df.sort_values(["firm_id", "year"])
df.Value.ne(1).groupby(df.firm_id).cumsum().lt(1).astype(int)
#     Value
# 0    1
# 1    1
# 2    0
# 3    0
# 4    0
# 5    0
# 6    0
# 7    0
# 8    0
# 9    1
# 10   1
# 11   1
# 12   1

根据 pandas 中的年份和类别创建虚拟变量

问题描述

1 个解决方案

解决方案1
0 2021-11-21 05:34:44

根据 pandas 中的年份和类别创建虚拟变量

问题描述

1 个解决方案

解决方案1 0 2021-11-21 05:34:44

解决方案1
0 2021-11-21 05:34:44