I would like to hash feature 'Genre' into 6 columns and separately feature 'Publisher' into another six columns. I want something like below:
Genre Publisher 0 1 2 3 4 5 0 1 2 3 4 5
0 Platform Nintendo 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
1 Racing Noir -1.0 0.0 0.0 0.0 0.0 -1.0 -1.0 0.0 0.0 0.0 0.0 -1.0
2 Sports Laura -2.0 2.0 0.0 -2.0 0.0 0.0 -2.0 2.0 0.0 -2.0 0.0 0.0
3 Roleplaying John -2.0 2.0 2.0 0.0 1.0 0.0 -2.0 2.0 2.0 0.0 1.0 0.0
4 Puzzle John 0.0 1.0 1.0 -2.0 1.0 -1.0 0.0 1.0 1.0 -2.0 1.0 -1.0
5 Platform Noir 0.0 2.0 2.0 -1.0 1.0 0.0 0.0 2.0 2.0 -1.0 1.0 0.0
The following code does what I want to do
import pandas as pd
d = {'Genre': ['Platform', 'Racing','Sports','Roleplaying','Puzzle','Platform'], 'Publisher': ['Nintendo', 'Noir','Laura','John','John','Noir']}
df = pd.DataFrame(data=d)
from sklearn.feature_extraction import FeatureHasher
fh1 = FeatureHasher(n_features=6, input_type='string')
fh2 = FeatureHasher(n_features=6, input_type='string')
hashed_features1 = fh.fit_transform(df['Genre'])
hashed_features2 = fh.fit_transform(df['Publisher'])
hashed_features1 = hashed_features1.toarray()
hashed_features2 = hashed_features2.toarray()
pd.concat([df[['Genre', 'Publisher']], pd.DataFrame(hashed_features1),pd.DataFrame(hashed_features2)],
axis=1)
This works for the above two feature but If I have lets say 40 categorical features then this approach would be tedious. Is there any other way to do?
Hashing (Update)
Assuming that new categories might show up in some of the features, hashing is the way to go. Just 2 notes:
One Hot Vector
In case the number of categories for each feature is fixed and not too large, use one hot encoding.
I would recommend using either of the two:
sklearn.preprocessing.OneHotEncoder
pandas.get_dummies
Example
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size,
input_type='string'), i) for i in range(n_orig_features)])
res_0 = ct.fit_transform(df) # res_0.shape[1] = n_orig_features * hash_vector_size
# Approach 1 (OHV)
res_1 = pd.get_dummies(df)
# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)
res_0
:
array([[ 0., 0., 0., 0., 1., 0., 0., 0., 1., -1., 0., -1.],
[ 0., 0., 0., 1., 0., 0., 0., 2., -1., 0., 0., 0.],
[ 0., -1., 0., 0., 0., 0., -2., 2., 2., -1., 0., -1.],
[ 0., 0., 0., 0., 1., 0., 0., 2., 1., -1., 0., -1.]])
res_1
:
feature_1_A feature_1_G feature_1_T feature_2_cat feature_2_dog feature_2_elephant feature_2_zebra
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 0 1 0 0 1 0
3 1 0 0 0 0 0 1
res_2
:
array([[1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 1.]])
Even though, I am late here, from the examples I have seen on Kaggle , FeatureHashing is performed at once for multiple columns (ie on a DataFrame) rather than for individual columns and concatenating the sparse matrices. See Notebooks on Kaggle, here and here . I have also used both ways of performing feature hashing on this data, ie:
a. Hash individual categorical columns and concatenate the results
b.Hash all categorical columns of a DataFrame at once
Logistic Regression classifier gave significantly better results when approach (b) was followed rather than approach (a).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.