简体   繁体   中英

Specify shape for categorical feature columns?

I know that I can use a categorical_column_with_identity to turn a categorical feature into a series of one-hot features.

For instance, if my vocabulary is ["ON", "OFF", "UNKNOWN"] :
"OFF" -> [0, 1, 0]

categorical_column = tf.feature_column.categorical_column_with_identity('column_name', num_buckets=3)
feature_column = tf.feature_column.indicator_column(categorical_column))

However, I actually have an 1-dimensional array of categorical features. I would like to turn that into a 2-dimensional series of one-hot features:

["OFF", "ON", "OFF", "UNKNOWN", "ON"]
->
[[0, 1, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]

Unlike every other feature column, it doesn't seem like there's a shape attribute on categorical_column_with_identity and I didn't find any help through Google or the docs.

Do I have to give up on categorical_column_with_identity and create the 2D array myself through a numerical_column ?

As per the comments, I'm not sure this functionality is possible with tensorflow . But with Pandas you have a trivial solution via pd.get_dummies :

import pandas as pd

L = ['OFF', 'ON', 'OFF', 'UNKNOWN', 'ON']

res = pd.get_dummies(L)

print(res)

   OFF  ON  UNKNOWN
0    1   0        0
1    0   1        0
2    1   0        0
3    0   0        1
4    0   1        0

For performance, or if you need only a NumPy array, you can use LabelBinarizer from sklearn.preprocessing :

from sklearn.preprocessing import LabelBinarizer

LB = LabelBinarizer()

res = LB.fit_transform(L)

print(res)

array([[1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0]])

A couple options for binary encoding

import tensorflow as tf
test = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
encoding = {x:idx for idx, x in enumerate(sorted(set(test)))}
test = [encoding[x] for x in test]
print(tf.keras.utils.to_categorical(test, num_classes=len(encoding)))

>>>[[1. 0. 0.]
    [0. 1. 0.]
    [1. 0. 0.]
    [0. 0. 1.]
    [0. 1. 0.]]

Or from scikit as the other answer stated

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
transfomed_label = encoder.fit_transform(["OFF", "ON", "OFF", "UNKNOWN", "ON"])
print(transfomed_label)

>>>[[1 0 0]
    [0 1 0]
    [1 0 0]
    [0 0 1]
    [0 1 0]]

You can use a dict as a map like this:

categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]
one_hot_features = []

map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}

for val in categorical_features:
    one_hot_features.append(map[val])

or with list comprehension: categorical_features = ["OFF", "ON", "OFF", "UNKNOWN", "ON"]

map = {"ON": [1, 0, 0], "OFF": [0, 1, 0], "UNKNOWN": [0, 0, 1]}
one_hot_features = [map[f] for f in categorical_features]

This should give you what you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM