简体   繁体   English

多类别分类和概率预测

[英]Multiclass Classification and probability prediction

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB

fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()

# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()


train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]

test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]

naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)

print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)

I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label. 我已编写此代码以接受来自具有128列的csv文件的输入,其中127列是要素,第128列是类标签。

I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. 我想预测样本属于每个类别的概率(有5个类别(1-5)),并打印成矩阵形式,然后根据预测确定样本的类别。 predict_proba() is not giving the desired output. Forecast_proba()没有提供所需的输出。 Please suggest required changes. 请提出所需的更改。

GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. GaussianNB.predict_proba返回模型中每个类的样本概率。 In your case, it should return a result with five columns with the same number of rows as in your test data. 在您的情况下,它应返回一个包含五列的结果,该列的行数与测试数据中的行数相同。 You can verify which column corresponds to which class using naive_b.classes_ . 您可以使用naive_b.classes_验证哪一列对应于哪个类。 So, it is not clear why you are saying that this is not the desired output. 因此,不清楚为什么要说这不是所需的输出。 Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. 也许,您的问题来自以下事实:您正在将预测Proba的输出分配给数据帧列。 Try: 尝试:

pred_prob = naive_b.predict_proba(test_features)

instead of 代替

test_data["p_malw"] = naive_b.predict_proba(test_features)

and verify its shape using pred_prob.shape. 并使用pred_prob.shape验证其形状。 The second dimension should be 5. 第二个维度应为5。

If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly. 如果需要每个样本的预测标签,则可以使用预测方法,然后使用混淆矩阵来查看已正确预测了多少个标签。

from sklearn.metrics import confusion_matrix

naive_B.fit(train_features, train_label)

pred_label = naive_B.predict(test_features)

confusion_m = confusion_matrix(test_label, pred_label)
confusion_m

Here is some useful reading. 这是一些有用的读物​​。

sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba sklearn GaussianNB- http: //scikit-learn.org/stable/modules/generation/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba

sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html sklearn confusion_matrix- http: //scikit-learn.org/stable/modules/generation/sklearn.metrics.confusion_matrix.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM