简体   繁体   中英

Save one-hot-encoded features into Pandas DataFrame the fastest way

I have a Pandas DataFrame with all my features and labels. One of my feature is categorical and needs to be one-hot-encoded.

The feature is an integer and can only have values from 0 to 4

To save those arrays back in my DataFrame I use the following code

# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())

My DataFrame has more than 1 million rows so the above code takes a while.Is there a faster way to assign the arrays to the DataFrame cells? Because I have just 5 categories i dont need to call the transform() function 1 million times.

I already tried something like

num_categories = 5
i = 0
while (i<num_categories):
    df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
    i += 1

Which yields this error

ValueError: Must have equal len keys and value when setting with an ndarray

You can use pd.get_dummies :

>>> s
0    a
1    b
2    c
3    a
dtype: object

>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

Alternatively:

>>> from sklearn.preprocessing import OneHotEncoder

>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
       [1],
       [3],
       [2],
       [2]]

>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM