简体   繁体   中英

Split data into columns in pandas

I have a df as

name category   dummy
USA   fx,ft,fe   1
INDIA fx         13

I need to convert this as

name category_fx categoty_ft category_fe  dummy 
USA  True         True        True         1
INDIA True        False       False        13

tried with series.explode() function but not getting this output.

You can use str.get_dummies and astype(bool) to convert your strings to new columns of booleans, then add_prefix to change the column names, and finally join :

df2 = (df.drop(columns='category)
         .join(df['category']
              .str.get_dummies(sep=',')
              .astype(bool)
              .add_prefix('category_')
              )
      )

or, for modification of the original dataframe:

df = df.join(df.pop('category')
               .str.get_dummies(sep=',')
               .astype(bool)
               .add_prefix('category_'))

output:

    name  category_fe  category_ft  category_fx
0    USA         True         True         True
1  INDIA        False        False         True
generalization to more columns

assuming this input:

    name category1 category2  dummy
0    USA  fx,ft,fe     a,b,c      1
1  INDIA        fx         d     13
cats = df.filter(like='category').columns
cols = list(df.columns.difference(cats))
(df
 .set_index(cols)
 .stack()
 .str.get_dummies(sep=',')
 .groupby(level=cols).max().astype(bool)
 .reset_index()
)

output:

   dummy   name      a      b      c      d     fe     ft    fx
0      1    USA   True   True   True  False   True   True  True
1     13  INDIA  False  False  False   True  False  False  True

Use Series.str.get_dummies by column category with converting 0,1 to boolean by DataFrame.astype and DataFrame.add_prefix :

c = df.columns.difference(['category'], sort=False).tolist()
df = (df.set_index(c)['category']
        .str.get_dummies(',')
        .astype(bool)
        .add_prefix('category_')
        .reset_index())
print (df)
    name  category_fe  category_ft  category_fx
0    USA         True         True         True
1  INDIA        False        False         True

EDIT: If need replace one column by multiple columns you can use:

df1 = (df['category']
        .str.get_dummies(',')
        .astype(bool)
        .add_prefix('category_'))

pos = df.columns.get_loc('category')
df = pd.concat([df.iloc[:, :pos], df1, df.iloc[:, pos+1:]], axis=1)
print (df)
    name  category_fe  category_ft  category_fx  dummy
0    USA         True         True         True      1
1  INDIA        False        False         True     13

This solution is modifid for multiple columns:

print (df)
    name  category  dummy category1
0    USA  fx,ft,fe      1       a,f
1  INDIA        fx     13       s,a

cols = ['category','category1']

dfs = [(df[c].str.get_dummies(',').astype(bool).add_prefix(f'{c}_')) for c in cols]

df = pd.concat([df, *dfs], axis=1).drop(cols, axis=1)
print (df)
    name  dummy  category_fe  category_ft  category_fx  category1_a  \
0    USA      1         True         True         True         True   
1  INDIA     13        False        False         True         True   

   category1_f  category1_s  
0         True        False  
1        False         True   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM