I have a list of words of the same length. I store the length in a variable LENGTH
. The list of words is called wordlist
. I want to predict the first letter given all the others. However I am having problems at the final stage where I want to associate probabilities from predict_proba
with letters.
This is my work so far:
My target vector is categorical and so are my features. In particular they can take values from the ASCII alphabet. In order to perform classification I first transform everything into integers. To do this I make two dictionaries, one that maps from letters to integers (alphabet_map) and one that maps from integers to letters (inv_alphabet_map).
alphabet = set()
for word in wordlist:
alphabet = alphabet.union(set(word))
alphabet_size = len(alphabet)
alphabet_map = dict()
alphabet_stored = sorted(alphabet)
inv_alphabet_map = dict()
for i in range(alphabet_size):
inv_alphabet_map[i] = alphabet_stored[i]
alphabet_map[alphabet_stored[i]] = i
I then make X and y as follows:
def makeX_simple(words, alphabet_map):
X = []
for idx in range(len(words)):
X.append(list(words[idx]))
X = [ [ alphabet_map[letter] for letter in sublist ] for sublist in X ]
X = np.array(X)
return X
X = makeX_simple(wordlist, alphabet_map)
y = X[:,0]
X = np.delete(X, 0, axis = 1)
Now I can build the classifier:
model = HistGradientBoostingClassifier(categorical_features=range(LENGTH-1), verbose=2,learning_rate=0.02).fit(X,y)
I can then do the following as a test:
models[0].predict_proba(X[0].reshape(1, -1))[0]
But how can I tell which probabilities are associated with which letters from my original input? Does predict_proba sort the target variables and do I need to make another map from the index in the output array to the letter in my original input?
One difficulty is that I need the mapping from letters to integers to be the same for X and y.
What is the right way to do all this?
To transform target labels you can use LabelEncoder()
, to categorical features you can use OrdinalEncoder()
OrdinalEncoder()
to transform features while trainingLabelEncoder()
to transform target values while trainingLabelEncoder()
using .classes_
and zip each prediction-probability list with list of target-labels . This will give you something like:[[('TARGET_A', 0.1),('TARGET_B',0.5),('TARGET_C',0.4)], [('TARGET_A', 0.5),('TARGET_B',0.3),('TARGET_C',0.2)]]
dict()
to convert each probability list from list
of tuples
to a dict
, for easier access to values, to get:[{'TARGET_A': 0.1,'TARGET_B',0.5:'TARGET_C':0.4}, {'TARGET_A': 0.5),'TARGET_B':0.3,'TARGET_C':0.2}]
Here is a sample:
Code to train/fit:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import joblib
# while training
X = # your_categorical_features
Y = # your_target_labels
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X)
x = ordinal_encoder.transform(X)
label_encoder = LabelEncoder()
label_encoder.fit(Y)
y = label_encoder.transform(Y)
model = HistGradientBoostingClassifier(categorical_features=range(LENGTH-1), verbose=2,learning_rate=0.02).fit(x,y)
# save label_encoder object
joblib.dump(label_encoder, label_encoder_filename)
# save ordinal_encoder object
joblib.dump(ordinal_encoder, ordinal_encoder_filename)
Code to predict:
# later while predicting
import joblib
# load label_encoder object
label_encoder = joblib.load(label_encoder_filename)
# load ordinal_encoder object
ordinal_encoder = joblib.load(ordinal_encoder_filename)
X = # your values to predict
x = ordinal_encoder.transform(X)
predictions = model.predict(x)
prediction_pobas = model.predict_proba(x)
p_labels = label_encoder.inverse_transform(predictions)
proba_list = []
for idx in range(len(prediction_pobas)):
proba_list.append(dict([*zip(label_encoder.classes_, prediction_pobas[idx])]))
# access prediction probability by
target_A_probability = proba_list[0]['target_A']
# To use predicted value to call predict_proba again encode using ordinal_encoder
X = # your new values to predict
x = ordinal_encoder.transform(X)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.