How to recover input categorical symbols from scikit-learn predict_proba?

Question

I have a list of words of the same length. I store the length in a variable LENGTH . The list of words is called wordlist . I want to predict the first letter given all the others. However I am having problems at the final stage where I want to associate probabilities from predict_proba with letters.

This is my work so far:

My target vector is categorical and so are my features. In particular they can take values from the ASCII alphabet. In order to perform classification I first transform everything into integers. To do this I make two dictionaries, one that maps from letters to integers (alphabet_map) and one that maps from integers to letters (inv_alphabet_map).

alphabet = set()
for word in wordlist:
    alphabet = alphabet.union(set(word))
alphabet_size = len(alphabet)
alphabet_map = dict()
alphabet_stored = sorted(alphabet)
inv_alphabet_map = dict()
for i in range(alphabet_size):
    inv_alphabet_map[i] = alphabet_stored[i]
    alphabet_map[alphabet_stored[i]] = i

I then make X and y as follows:

def makeX_simple(words, alphabet_map):
    X = []
    for idx in range(len(words)):
        X.append(list(words[idx]))
    X = [ [ alphabet_map[letter] for letter in sublist ] for sublist in X ]
    X = np.array(X)
    return X

X = makeX_simple(wordlist, alphabet_map)
y = X[:,0]
X = np.delete(X, 0, axis = 1)

Now I can build the classifier:

model = HistGradientBoostingClassifier(categorical_features=range(LENGTH-1), verbose=2,learning_rate=0.02).fit(X,y)

I can then do the following as a test:

models[0].predict_proba(X[0].reshape(1, -1))[0]

But how can I tell which probabilities are associated with which letters from my original input? Does predict_proba sort the target variables and do I need to make another map from the index in the output array to the letter in my original input?

One difficulty is that I need the mapping from letters to integers to be the same for X and y.

What is the right way to do all this?

Answer 1

To transform target labels you can use LabelEncoder() , to categorical features you can use OrdinalEncoder()

Use OrdinalEncoder() to transform features while training
Use LabelEncoder() to transform target values while training
After prediction get target labels from LabelEncoder() using .classes_ and zip each prediction-probability list with list of target-labels . This will give you something like:

[[('TARGET_A', 0.1),('TARGET_B',0.5),('TARGET_C',0.4)], [('TARGET_A', 0.5),('TARGET_B',0.3),('TARGET_C',0.2)]]

use dict() to convert each probability list from list of tuples to a dict , for easier access to values, to get:

[{'TARGET_A': 0.1,'TARGET_B',0.5:'TARGET_C':0.4}, {'TARGET_A': 0.5),'TARGET_B':0.3,'TARGET_C':0.2}]

Here is a sample:

Code to train/fit:

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import joblib

# while training
X = # your_categorical_features
Y = # your_target_labels

ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X)
x = ordinal_encoder.transform(X)

label_encoder = LabelEncoder()
label_encoder.fit(Y)
y = label_encoder.transform(Y)

model = HistGradientBoostingClassifier(categorical_features=range(LENGTH-1), verbose=2,learning_rate=0.02).fit(x,y)
# save label_encoder object
joblib.dump(label_encoder, label_encoder_filename)
# save ordinal_encoder object
joblib.dump(ordinal_encoder, ordinal_encoder_filename)

Code to predict:

# later while predicting
import joblib

# load label_encoder object
label_encoder = joblib.load(label_encoder_filename)
# load ordinal_encoder object
ordinal_encoder = joblib.load(ordinal_encoder_filename)

X = # your values to predict
x = ordinal_encoder.transform(X)

predictions = model.predict(x)
prediction_pobas = model.predict_proba(x)
p_labels = label_encoder.inverse_transform(predictions)

proba_list = []
for idx in range(len(prediction_pobas)):
    proba_list.append(dict([*zip(label_encoder.classes_, prediction_pobas[idx])]))

# access prediction probability by
target_A_probability = proba_list[0]['target_A']

# To use predicted value to call predict_proba again encode using ordinal_encoder
X = # your new values to predict
x = ordinal_encoder.transform(X)

How to recover input categorical symbols from scikit-learn predict_proba?

Question

1 answers

solution1
1 2021-03-20 08:17:29

How to recover input categorical symbols from scikit-learn predict_proba?

Question

1 answers

solution1 1 2021-03-20 08:17:29

solution1
1 2021-03-20 08:17:29