简体   繁体   中英

How to recover input categorical symbols from scikit-learn predict_proba?

I have a list of words of the same length. I store the length in a variable LENGTH . The list of words is called wordlist . I want to predict the first letter given all the others. However I am having problems at the final stage where I want to associate probabilities from predict_proba with letters.

This is my work so far:

My target vector is categorical and so are my features. In particular they can take values from the ASCII alphabet. In order to perform classification I first transform everything into integers. To do this I make two dictionaries, one that maps from letters to integers (alphabet_map) and one that maps from integers to letters (inv_alphabet_map).

alphabet = set()
for word in wordlist:
    alphabet = alphabet.union(set(word))
alphabet_size = len(alphabet)
alphabet_map = dict()
alphabet_stored = sorted(alphabet)
inv_alphabet_map = dict()
for i in range(alphabet_size):
    inv_alphabet_map[i] = alphabet_stored[i]
    alphabet_map[alphabet_stored[i]] = i

I then make X and y as follows:

def makeX_simple(words, alphabet_map):
    X = []
    for idx in range(len(words)):
        X.append(list(words[idx]))
    X = [ [ alphabet_map[letter] for letter in sublist ] for sublist in X ]
    X = np.array(X)
    return X

X = makeX_simple(wordlist, alphabet_map)
y = X[:,0]
X = np.delete(X, 0, axis = 1)

Now I can build the classifier:

model = HistGradientBoostingClassifier(categorical_features=range(LENGTH-1), verbose=2,learning_rate=0.02).fit(X,y)

I can then do the following as a test:

models[0].predict_proba(X[0].reshape(1, -1))[0]

But how can I tell which probabilities are associated with which letters from my original input? Does predict_proba sort the target variables and do I need to make another map from the index in the output array to the letter in my original input?

One difficulty is that I need the mapping from letters to integers to be the same for X and y.

What is the right way to do all this?

To transform target labels you can use LabelEncoder() , to categorical features you can use OrdinalEncoder()

  1. Use OrdinalEncoder() to transform features while training
  2. Use LabelEncoder() to transform target values while training
  3. After prediction get target labels from LabelEncoder() using .classes_ and zip each prediction-probability list with list of target-labels . This will give you something like:
[[('TARGET_A', 0.1),('TARGET_B',0.5),('TARGET_C',0.4)], [('TARGET_A', 0.5),('TARGET_B',0.3),('TARGET_C',0.2)]]
  1. use dict() to convert each probability list from list of tuples to a dict , for easier access to values, to get:
[{'TARGET_A': 0.1,'TARGET_B',0.5:'TARGET_C':0.4}, {'TARGET_A': 0.5),'TARGET_B':0.3,'TARGET_C':0.2}]

Here is a sample:

Code to train/fit:

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import joblib

# while training
X = # your_categorical_features
Y = # your_target_labels

ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X)
x = ordinal_encoder.transform(X)

label_encoder = LabelEncoder()
label_encoder.fit(Y)
y = label_encoder.transform(Y)

model = HistGradientBoostingClassifier(categorical_features=range(LENGTH-1), verbose=2,learning_rate=0.02).fit(x,y)
# save label_encoder object
joblib.dump(label_encoder, label_encoder_filename)
# save ordinal_encoder object
joblib.dump(ordinal_encoder, ordinal_encoder_filename)

Code to predict:

# later while predicting
import joblib

# load label_encoder object
label_encoder = joblib.load(label_encoder_filename)
# load ordinal_encoder object
ordinal_encoder = joblib.load(ordinal_encoder_filename)

X = # your values to predict
x = ordinal_encoder.transform(X)

predictions = model.predict(x)
prediction_pobas = model.predict_proba(x)
p_labels = label_encoder.inverse_transform(predictions)

proba_list = []
for idx in range(len(prediction_pobas)):
    proba_list.append(dict([*zip(label_encoder.classes_, prediction_pobas[idx])]))

# access prediction probability by
target_A_probability = proba_list[0]['target_A']

# To use predicted value to call predict_proba again encode using ordinal_encoder
X = # your new values to predict
x = ordinal_encoder.transform(X)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM