简体   繁体   English

如何使用 sklearn 的 SGDClassifier 获得前 3 名或前 N 名预测

[英]How to get Top 3 or Top N predictions using sklearn's SGDClassifier

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import linear_model
arr=['dogs cats lions','apple pineapple orange','water fire earth air', 'sodium potassium calcium']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(arr)
feature_names = vectorizer.get_feature_names()
Y = ['animals', 'fruits', 'elements','chemicals']
T=["eating apple roasted in fire and enjoying fresh air"]
test = vectorizer.transform(T)
clf = linear_model.SGDClassifier(loss='log')
clf.fit(X,Y)
x=clf.predict(test)
#prints: elements

In the above code, clf.predict() prints only 1 best prediction for a sample from list X .在上面的代码中, clf.predict()仅打印列表 X 中样本的 1 个最佳预测。 I am interested in top 3 predictions for a particular sample in the list X , i know the function predict_proba / predict_log_proba returns a list of all probabilities for each feature in list Y , but it has to sorted and then associated with the features in list Y before getting the top 3 results .我对列表 X 中特定样本的前 3 个预测感兴趣,我知道函数predict_proba / predict_log_proba返回列表 Y 中每个特征的所有概率的列表,但它必须排序,然后与列表 Y 中的特征相关联在获得前 3 个结果之前 Is there any direct and efficient way?有什么直接有效的方法吗?

There is no built-in function, but what is wrong with没有内置函数,但是有什么问题

probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[-n:]

? ?

As suggested by one of the comment, should change [-n:] to [:,-n:]正如其中一条评论所建议的,应该将[-n:]更改为[:,-n:]

probs = clf.predict_proba(test)
best_n = np.argsort(probs, axis=1)[:,-n:]

I know this has been answered...but I can add a bit more...我知道这已经得到了回答......但我可以补充一点......

#both preds and truths are same shape m by n (m is number of predictions and n is number of classes)
def top_n_accuracy(preds, truths, n):
    best_n = np.argsort(preds, axis=1)[:,-n:]
    ts = np.argmax(truths, axis=1)
    successes = 0
    for i in range(ts.shape[0]):
      if ts[i] in best_n[i,:]:
        successes += 1
    return float(successes)/ts.shape[0]

It's quick and dirty but I find it useful.它又快又脏,但我觉得它很有用。 One can add their own error checking, etc..可以添加自己的错误检查等。

Hopefully, Andreas will help with this.希望Andreas会对此有所帮助。 predict_probs is not available when loss='hinge'.当 loss='hinge' 时 predict_probs 不可用。 To get top n class when loss='hinge' do:要在 loss='hinge' 时获得前 n 级,请执行以下操作:

calibrated_clf = CalibratedClassifierCV(clfSDG, cv=3, method='sigmoid')
model = calibrated_clf.fit(train.data, train.label)

probs = model.predict_proba(test_data)
sorted( zip( calibrated_clf.classes_, probs[0] ), key=lambda x:x[1] )[-n:]

Not sure if clfSDG.predict and calibrated_clf.predict will always predict the same class.不确定 clfSDG.predict 和calibred_clf.predict 是否总是预测相同的类。

argsort gives results in ascending order, if you want to save yourself with unusual loops or confusion you can use a simple trick. argsort按升序给出结果,如果你想避免不寻常的循环或混乱,你可以使用一个简单的技巧。

probs = clf.predict_proba(test)
best_n = np.argsort(-probs, axis=1)[:, :n]

Negating the probabilities will turn smallest to largest and hence you can take top-n results in descending order.否定概率将变为最小到最大,因此您可以按降序获取前 n 个结果。

As @FredFoo described in How do I get indices of N maximum values in a NumPy array?正如@FredFoo 在如何获取 NumPy 数组中的 N 个最大值的索引中所述? a faster method would be to use argpartition .更快的方法是使用argpartition

Newer NumPy versions (1.8 and up) have a function called argpartition for this.较新的 NumPy 版本(1.8 及更高版本)为此有一个名为argpartition的函数。 To get the indices of the four largest elements, do要获得四个最大元素的索引,请执行

>>> a = np.array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> a array([9, 4, 4, 3, 3, 9, 0, 4, 6, 0])
>>> ind = np.argpartition(a, -4)[-4:]
>>> ind array([1, 5, 8, 0])
>>> a[ind] array([4, 9, 6, 9])

Unlike argsort , this function runs in linear time in the worst case, but the returned indices are not sorted, as can be seen from the result of evaluating a[ind] .argsort不同,此函数在最坏情况下以线性时间运行,但返回的索引未排序,从评估a[ind]的结果可以看出。 If you need that too, sort them afterwards:如果您也需要它,请稍后对它们进行排序:

>>> ind[np.argsort(a[ind])] array([1, 8, 5, 0]) 

To get the top-k elements in sorted order in this way takes O(n + k log k) time.以这种方式按排序顺序获取top-k元素需要O(n + k log k)时间。

I wrote a function that outputs a dataframe with the top n predictions and their probabilities, and ties it back to class names.我编写了一个函数,该函数输出一个包含前 n 个预测及其概率的数据帧,并将其与类名联系起来。 Hope this is helpful!希望这是有帮助的!

def return_top_n_pred_prob_df(n, model, X_test, column_name):
  predictions = model.predict_proba(X_test)
  preds_idx = np.argsort(-predictions) 
  classes = pd.DataFrame(model.classes_, columns=['class_name'])
  classes.reset_index(inplace=True)
  top_n_preds = pd.DataFrame()
  for i in range(n):
        top_n_preds[column_name + '_prediction_{}_num'.format(i)] =     [preds_idx[doc][i] for doc in range(len(X_test))]
    top_n_preds[column_name + '_prediction_{}_probability'.format(i)] = [predictions[doc][preds_idx[doc][i]] for doc in range(len(X_test))]
    top_n_preds = top_n_preds.merge(classes, how='left', left_on= column_name + '_prediction_{}_num'.format(i), right_on='index')
    top_n_preds = top_n_preds.rename(columns={'class_name': column_name + '_prediction_{}'.format(i)})
    try: top_n_preds.drop(columns=['index', column_name + '_prediction_{}_num'.format(i)], inplace=True) 
    except: pass
  return top_n_preds

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM