简体   繁体   中英

Pandas get_dummies generates multiple columns for the same feature

I'm using a pandas series and trying to convert it to one hot encoding. I'm using the describe method in order to check how many unique categories the series has. The output is:

input['pattern'].describe(include='all')

    count     9725
    unique       7
    top          1
    freq      4580
    Name: pattern, dtype: object

When I'm trying:

    x = pd.get_dummies(input['pattern'])
    x.describe(include= 'all')

I get 18 classes with 12 classes which are completely zeros. How come did get_dummies produced classes which did not occur even once in the input?

From a discussion in the comments, it was deduced that your column contained a mixture of strings and integers.

For example,

s = pd.Series(['0', 0, '0', '6', 6, '6', '3', '3'])
s

0    0
1    0
2    0
3    6
4    6
5    6
6    3
7    3
dtype: object

Now, calling pd.get_dummies would result in multiple such columns of the same feature.

pd.get_dummies(s)

   0  6  0  3  6
0  0  0  1  0  0
1  1  0  0  0  0
2  0  0  1  0  0
3  0  0  0  0  1
4  0  1  0  0  0
5  0  0  0  0  1
6  0  0  0  1  0
7  0  0  0  1  0

The fix is to ensure that all elements are of the same type. I'd recommend, for this case, converting to str .

s.astype(str).str.get_dummies()


   0  3  6
0  1  0  0
1  1  0  0
2  1  0  0
3  0  0  1
4  0  0  1
5  0  0  1
6  0  1  0
7  0  1  0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM