简体   繁体   中英

How can I apply function to create dummy variable?

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

data={'state':[1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
      'year':[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
      'pop':[11, 22, 0, 33, 44, 32, 45, 66, 34, 12, 32, 0],
      'gdp':[123, 341, 554, 654, 245, 665, 332 ,321, 344, 232, 542, 221]}
frame=pd.DataFrame(data)

def treat(group):
        if group.ix[group.year==3, 'pop']!=0:  
            group['Treated']=1
        else:
            group['Treated']=0    

frame.groupby('state').apply(treat)

I am trying to create a variable frame['Treated'] according to some condition. if ('year'==3) and ('pop'!=0) - I think the 'state' is in the Treated group (so I created a variable called 'Treated' ).

Unfortunately I end up with an error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What's wrong with my code? Do you know how I could solve this problem?

Reedit Thank for your kind answer, and I'm sorry for having not described my problem clearly.

I'm trying to describe my problem again. For state 1 , the pop is 0 in the year 3 ,so state 1 is not in the treated group (as following shows, frame['Treated']=0 for state 1 in every year) For state 2, the pop is not equal to 0 in the year 3, so state 2 is in the treated group (as following shows, frame['Treated']=1 for state 2 in every year) other states are processed for similar reason. The final result is like the following.

    state  year  pop  gdp  Treated
0       1     1   11  123        0
1       1     2   22  341        0
2       1     3    0  554        0
3       2     1   33  654        1
4       2     2   44  245        1
5       2     3   32  665        1
6       3     1   45  332        1
7       3     2   66  321        1
8       3     3   34  344        1
9       4     1   12  232        0
10      4     2   32  542        0
11      4     3    0  221        0

groupby is not needed here , you just need np.where

frame['Treated']=np.where((frame.year==3)&(frame.pop!=0),1,0)
frame
Out[429]: 
    gdp  pop  state  year  Treated
0   123   11      1     1        0
1   341   22      1     2        0
2   554    0      1     3        1
3   654   33      2     1        0
4   245   44      2     2        0
5   665   32      2     3        1
6   332   45      3     1        0
7   321   66      3     2        0
8   344   34      3     3        1
9   232   12      4     1        0
10  542   32      4     2        0
11  221    0      4     3        1

An alternative to np.where would be to convert the appropriate boolean mask to integer type.

frame['Treated'] = (frame.year.eq(3) & frame['pop'].ne(0)).astype(int)

Your current code does not work because

group.ix[group.year==3, 'pop']!=0

leaves you with a Pandas Series still, which you can't safely use in an if statement. In any case, using apply like this is bad form when you can solve your issue with a boolean mask.

Using pandas.DataFrame.assign and pandas.DataFrame.eval

frame.assign(Treated=frame.eval('pop != 0 & year == 3') * 1)

    gdp  pop  state  year  Treated
0   123   11      1     1        0
1   341   22      1     2        0
2   554    0      1     3        0
3   654   33      2     1        0
4   245   44      2     2        0
5   665   32      2     3        1
6   332   45      3     1        0
7   321   66      3     2        0
8   344   34      3     3        1
9   232   12      4     1        0
10  542   32      4     2        0
11  221    0      4     3        0

I multiply by one to force an integer. It is shorter code but not as efficient as @miradulo's astype(int)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM