简体   繁体   English

为什么我的逻辑回归只产生一个 class?

[英]Why does my logistic regression yield only one class?

I attempted my first machine learning project using a fictional dataset from Kaggle consisting of 1470 records.我使用来自 Kaggle 的虚构数据集尝试了我的第一个机器学习项目,该数据集包含 1470 条记录。 84% of records were in the '0' class and 16% were '1's. 84% 的记录位于“0”class 中,16% 为“1”。 I used 1200 records to train and test and saved 270 to feed in as new data to see what would happen.我使用了 1200 条记录来训练和测试,并保存了 270 条作为新数据输入以查看会发生什么。 I ended up with a training score of 87% and a test score of 83%, but all 270 records of new data were classified as 0.我最终得到了 87% 的训练分数和 83% 的测试分数,但是所有 270 条新数据记录都被归类为 0。

Could it be that the data, being fictional, just doesn't make enough of a pattern to teach the machine how to classify?会不会是虚构的数据不足以形成足够的模式来教机器如何分类? Or am I doing something wrong?还是我做错了什么?

I've read some of the other posts that seem to concern a similar problem but I don't find a relevant response.我已经阅读了其他一些似乎涉及类似问题的帖子,但我没有找到相关的回复。 Any help would be appreciated.任何帮助,将不胜感激。

df=pd.read_csv('Resources/train_data.csv')
    
df_skinny =df.drop(['EducationField','EmployeeCount','EmployeeNumber','index',
    'StandardHours', 
    'JobRole','MaritalStatus','DailyRate','MonthlyRate','HourlyRate','Over18','OverTime'], 
    axis=1).drop_duplicates()
    df_skinny.rename(columns={"Attrition": "EmploymentStatus"}, inplace=True)
    df_skinny['EmploymentStatus'] = df_skinny['EmploymentStatus'].replace(['Yes','No'],[1,0])

df_skinny['Gender']=df_skinny['Gender'].replace(['Female','Male'],[0,1]) df_skinny['BusinessTravel'] = df_skinny['BusinessTravel'].replace(['Travel_Rarely','Travel_Frequently','Non-Travel'],[1,2,0]) df_skinny['Department']=df_skinny['Department'].replace(['Human Resources','Sales','R&D'],[0,1,2]) df_skinny['Gender']=df_skinny['Gender'].replace(['Female','Male'],[0,1]) df_skinny['BusinessTravel'] = df_skinny['BusinessTravel'].replace([' Travel_Rarely','Travel_Frequently','Non-Travel'],[1,2,0]) df_skinny['Department']=df_skinny['Department'].replace(['Human Resources','Sales','R&D '],[0,1,2])

df_train=df_skinny[:1200]
df_new=df_skinny[1201:]

X =df_train.drop("EmploymentStatus", axis=1)
y = df_train["EmploymentStatus"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

classifier.fit(X_train_scaled, y_train)

print(f"Training Data Score: {classifier.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test_scaled, y_test)}")

predictions = classifier.predict(X_test_scaled)
print(f"First 30 Predictions:   {predictions[:30]}")
print(f"First 30 Actual Employment Status: {y_test[:30].tolist()}")

new_X = df_new.drop("EmploymentStatus", axis=1)
new_predictions=classifier.predict(new_X)
print(new_predictions)

ynew = classifier.predict_proba(new_X)
print(ynew)

OUTPUT:
Training Data Score: 0.8655555555555555
Testing Data Score: 0.8333333333333334

First 30 Predictions:   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]

First 30 Actual Employment Status: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0] 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]

[[1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 5.24119991e-298]
 [1.00000000e+000 7.88999798e-158]
 [1.00000000e+000 2.73485216e-286]
 [1.00000000e+000 0.00000000e+000]
 [1.00000000e+000 0.00000000e+000]

As you have mentioned that 84% of the data is in class 0 and 16% is in class 1. This is very unbalanced data and model will be very biased in this case.正如您所提到的,84% 的数据在 class 0 中,16% 在 class 1 中。这是非常不平衡的数据,在这种情况下 model 将非常有偏差。 That's why you are getting the results mostly as 0.这就是为什么你得到的结果大多为 0。

A good dataset is something that has balanced data among all the classes.一个好的数据集是在所有类之间具有平衡数据的东西。 So you need to make it balanced using Random sampling techniques.因此,您需要使用Random sampling技术使其平衡。 There are two kinds of sampling oversampling and undersampling .采样有oversamplingundersampling两种。

I recommend you to apply sampling techniques to balance your data first.我建议您首先应用采样技术来平衡您的数据。

You can learn more about it from the below article https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/您可以从以下文章https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/中了解更多信息

You can refer this notebook https://www.kaggle.com/shweta2407/oversampling-vs-undersampling-techniques可以参考这个笔记本https://www.kaggle.com/shweta2407/oversampling-vs-undersampling-techniques

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM