简体   繁体   中英

How to make pandas get_dummies to act like DictVectorizer

Consider the dataframe df which equals:

  apple  carrot pear
0     3       1     
1             3    2
2     4       1    3

I can one-hot encode this using sklearn's DictVectorizer as follows:

from sklearn.feature_extraction import DictVectorizer
enc = DictVectorizer(sparse = False)
enc.fit_transform(df.T.to_dict().values())

This gives:

array([[ 3.,  0.,  1.,  0.,  1.],
       [ 0.,  1.,  3.,  2.,  0.],
       [ 4.,  0.,  1.,  3.,  0.]])

We can see the features names of the columns with:

enc.feature_names_
['apple', 'apple=', 'carrot', 'pear', 'pear=']

So we can see the second column indicates if the apple column held '' for example.

If we try to the same thing with get_dummies we get:

pd.get_dummies(df)
   carrot  apple_3  apple_4  apple_  pear_2  pear_3  pear_
0       1        1        0       0       0       0      1
1       3        0        0       1       1       0      0
2       1        0        1       0       0       1      0

This seems to have made a categorical variable for each value in the apple and pear columns, presumably because the column has a non-numerical type now. This is not what I wanted. In my real data there will be lots of of different numerical values and the only non-numerical value is '' , so this would create a huge number of extra columns unnecessarily.

Is it possible to make get_dummies give the same output as sklearn's DictVectorizer?

In general, as my dataframe will be very large, is there any way to go directly to what DictVectorizer produces without first converting from a dataframe to a list of dictionaries.

I can't get pandas.get_dummies() to work like this, and I don't think it's set up to be able to only create categorical variables for certain values.

I made this Gist that gives the output you want. It applies a function that replaces null values with 1., and not null values with 0. You can then merge this new DataFrame with the original one to get the result you want.

I don't think get_dummies can do that.

However this answer uses DictVectorizer with passing directly the dataframe, it will avoid conversion to dict .

The following (by pratapvardhan) works:

dfn = df.apply(pd.to_numeric, errors='coerce').isnull() # or df.applymap(np.isreal)
df.mask(dfn, 0).join(pd.get_dummies(df.where(dfn)).filter(like='_'))

It would be very interesting to compare the speed of this solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM