简体   繁体   English

如何修复此分类报告警告?

[英]How to fix this classification report warning?

I created a model for multiclass classification.我为多类分类创建了一个 model。 Everything went good, got a validation accuracy of 84% but when I printed the classification report I got this warning:一切顺利,验证准确率为 84%,但当我打印分类报告时,我收到了以下警告:

 UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

classification report:分类报告:

              precision    recall  f1-score   support

           0       0.84      1.00      0.91     51890
           1       0.67      0.04      0.08      8706
           2       0.00      0.00      0.00      1605

    accuracy                           0.84     62201
   macro avg       0.50      0.35      0.33     62201
weighted avg       0.79      0.84      0.77     62201

Source Code -源代码 -

import pandas as pd

df=pd.read_csv('Crop_Agriculture_Data_2.csv')
df=df.drop('ID',axis=1)

dummies=pd.get_dummies(df[['Crop_Type', 'Soil_Type', 'Pesticide_Use_Category', 'Season']],drop_first=True)
df=df.drop(['Crop_Type', 'Soil_Type', 'Pesticide_Use_Category', 'Season'],axis=1)
df=pd.concat([df,dummies],axis=1)

df['Crop_Damage']=df['Crop_Damage'].map({'Minimal Damage':0,'Partial Damage':1,'Significant Damage':2})

x=df.drop('Crop_Damage',axis=1).values
y=df.Crop_Damage.values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.3,random_state=101)

from sklearn.preprocessing import MinMaxScaler
mms=MinMaxScaler()
x_train=mms.fit_transform(x_train)
x_test=mms.transform(x_test)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Flatten

model=Sequential()
model.add(Flatten())
model.add(Dense(10,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(6,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(3,activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(x_train,y_train,validation_data=(x_test,y_test),epochs=13)

import numpy as np
pred=np.argmax(model.predict(x_test),axis=-1)

from sklearn.metrics import classification_report
print(classification_report(y_test,pred))

I think it might be because most of the data is in one category but I'm not sure.我认为这可能是因为大多数数据都属于一个类别,但我不确定。 Is there anything I can do to solve this?我能做些什么来解决这个问题吗?

You don't want to get rid this warning as it says that your class 2 are not on the predictions as there were no samples in the training set您不想摆脱此警告,因为它表示您的 class 2 不在预测中,因为训练集中没有样本

you got an imbalance classification problem and the class 2 has realy low number of samples, and it was present in the test data only您遇到了不平衡分类问题,并且 class 2 的样本数量非常少,并且仅存在于测试数据中

I suggest you 2 things我建议你两件事

StratifiedKFold So when you split for training and test, it consider all classes StratifiedKFold因此,当您拆分训练和测试时,它会考虑所有类

Oversampling you might need adjust your data by randomly resample the training dataset to duplicate examples from the minority class 过度采样您可能需要通过随机重新采样训练数据集来调整数据,以复制少数 class 中的示例

As desternaut said, you have a warning , not an error.正如desternaut所说,你有一个警告,而不是一个错误。

This warning is saying you that the classification_report output is influenced because of one of labels is never predicted for your model (in your case, label "2").此警告是说您的classification_report报告 output 受到影响,因为从未为您的 model 预测标签之一(在您的情况下,label“2”)。

This will generate a problem calculating Precision (dividing by 0) , because ( true positives + false positives =0 ).这将产生计算精度(除以 0)的问题,因为( true positives + false positives =0 )。 When the function deals with this problem, aoutomatically output 0. Note this is not the real value, it should be "undefined" or something like this, but it's his approach.当 function 处理这个问题时,自动 output 0。注意这不是真正的值,它应该是“未定义”或类似的东西,但这是他的方法。 As you can see, when you are calculating macro avg, you are using this calculated 0. So the error is just reminding you that you macro avg is influenced by a "fake" 0.如您所见,当您计算宏平均值时,您使用的是计算得出的 0。因此该错误只是提醒您宏平均值受到“假”0 的影响。

The same happens with F1-score, since it is calculated starting from Precision. F1-score 也是如此,因为它是从 Precision 开始计算的。

How to solve that?如何解决? Well, technically you don't have nothin to solved because it's not an eror, so you can deal with it.好吧,从技术上讲,您无需解决任何问题,因为这不是错误,因此您可以处理它。 But you have to be aware that your output is being influenced.但是您必须意识到您的 output 正在受到影响。

What you can do, is decide that you are not interested in the scores of labels that were not predicted, and then explicitly specify the labels you are interested in (which are labels that were predicted at least once):您可以做的是确定您对未预测的标签分数不感兴趣,然后明确指定您感兴趣的标签(这些标签至少被预测过一次):

print(classification_report(y_test,pred,labels=np.unique(y_pred))

Note that this solution is not good at all because it's hidden problems you have with your model and data, but it can be useful in some cases.请注意,此解决方案根本不好,因为它隐藏了您的 model 和数据的问题,但在某些情况下它可能很有用。

Moreover, as Yefet said, your model seems to have problem classifying label "2" because you have an unbalanced data.此外,正如Yefet所说,您的 model 似乎在将 label 分类为“2”时存在问题,因为您的数据不平衡。 Follow his suggestions and improve your model if you can.如果可以,请按照他的建议改进您的 model。

If you only want to get rid of the warning even knowing that it's hiding problems, you could use zero_division parameter.如果您只想摆脱警告,即使知道它隐藏了问题,您可以使用zero_division参数。

According to the documentation :根据文档

zero_division: “warn”, 0 or 1, default=”warn” zero_division: “警告”,0或1,默认=“警告”

Sets the value to return when there is a zero division.设置零除法时要返回的值。 If set to “warn”, this acts as 0, but warnings are also raised.如果设置为“警告”,则此值为 0,但也会发出警告。

So you could hide warnings without altering the result of the classification report with:因此,您可以在不更改分类报告结果的情况下隐藏警告:

print(classification_report(y_test,pred, zero_division=0))

I had the same problem and the solution presented above worked:我遇到了同样的问题,上面提出的解决方案有效:

classification_report(y_test, y_pred, labels=np.unique(y_pred))

But after thoroughly checking the data, I came to the conclusion that some columns in my data where too large and needed to be scaled/normalised.但在彻底检查数据后,我得出的结论是,我的数据中的某些列太大,需要缩放/标准化。

Therefore, adding the following scaling to the code provides much better results (at least in my case).因此,将以下缩放添加到代码中可以提供更好的结果(至少在我的情况下)。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM