简体   繁体   中英

pythonic way of making dummy column from sum of two values

I have a dataframe with one column called label which has the values [0,1,2,3,4,5,6,8,9] . I would like to make dummy columns out of this, but I would like some labels to be joined together, so for example I want dummy_012 to be 1 if the observation has either label 0, 1 or 2.

If i use the command df2 = pd.get_dummies(df, columns=['label']) , it would create 9 columns, 1 for each label.

I know I can use df2['dummy_012']=df2['dummy_0']+df2['dummy_1']+df2['dummy_2'] after that to turn it into one joint column, but I want to know if there's a more pythonic way of doing it (or some function where i can just change the parameters to the joins).

Maybe this approach can give a idea:

groups = ['012', '345', '6789']
for gp in groups:
    df.loc[df['Label'].isin([int(x) for x in gp]), 'Label_Group'] = f'dummies_{gp}'

Output:

   Label   Label_Group
0      0   dummies_012
1      1   dummies_012
2      2   dummies_012
3      3   dummies_345
4      4   dummies_345
5      5   dummies_345
6      6  dummies_6789
7      8  dummies_6789
8      9  dummies_6789

And then apply dummy:

df_dummies = pd.get_dummies(df['Label_Group'])
   dummies_012  dummies_345  dummies_6789
0            1            0             0
1            1            0             0
2            1            0             0
3            0            1             0
4            0            1             0
5            0            1             0
6            0            0             1
7            0            0             1
8            0            0             1

I don't know that this is pythonic because a more elegant solution might exist, but I does allow you to change parameters and it's vectorized. I've read that get_dummies() can be a bit slow with large amounts of data and vectorizing pandas is good practice in general. So I vectorized this function and had it do its calculations with numpy arrays. It should give you a boost in performance as the dataset increases in size compared to similar functions.

This function will take your dataframe and a list of numbers as strings and will return your dataframe with the column you wanted.

def get_dummy(df,column_nos):
    new_col_name = 'dummy_'+''.join([i for i in column_nos])
    vector_sum = sum([df[i].values for i in column_nos])
    df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]

    return df

In case you'd rather the input to be integers rather than strings, you can tweak the above function to look like below.

def get_dummy(df,column_nos):
    column_names = ['dummy_'+str(i) for i in column_nos]
    new_col_name = 'dummy_'+''.join([str(i) for i in sorted(column_nos)])

    vector_sum = sum([df[i].values for i in column_names])
    df[new_col_name] = [1 if i>0 else 0 for i in vector_sum]

    return df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM