简体   繁体   English

如何使用单独的 df 对训练和测试数据进行逻辑回归 model 的预测

[英]How to make predictions on a logistic regression model with a separate df for train and test data

I am working on a logistic regression model.我正在研究逻辑回归 model。 I started out with two separate CSV files, one for training data and one for testing data.我从两个独立的 CSV 文件开始,一个用于训练数据,一个用于测试数据。 I created two separate data frames, one for each data set.我创建了两个单独的数据框,每个数据集一个。 I am able to fit and train the model just fine but am getting an error when I try to make predictions using the test data.我能够很好地拟合和训练 model,但是当我尝试使用测试数据进行预测时出现错误。

I am not sure if I am setting my y_train variable properly or if there is another issue going on.我不确定我是否正确设置了 y_train 变量,或者是否还有其他问题。 I get the following error messages when I run the prediction.运行预测时,我收到以下错误消息。

Here is the setup and code for the model"这是模型的设置和代码”

#Setting x and y values
X_train = clean_df_train[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_train = clean_df_train['churn']

X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_test = clean_df_test['churn']
#Fitting / Training the Logistic Regression Model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
#Make Predictions with Logit Model
predictions = logreg.predict(X_test)

#Measure Performance of the model
from sklearn.metrics import classification_report

#Measure performance of the model
classification_report(y_test, predictions)
  1522     """
   1523 
-> 1524     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1525 
   1526     labels_given = True

E:\Users\davidwool\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     79     if len(y_type) > 1:
     80         raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81                          "and {1} targets".format(type_true, type_pred))
     82 
     83     # We can't have more than one value on y_type => The set is no more needed

ValueError: Classification metrics can't handle a mix of continuous and binary targets

Here is the head of the data that I am working with.这是我正在使用的数据的负责人。 The churn column is completely blank as it is what I am trying to predict.流失列是完全空白的,因为这是我想要预测的。

clean_df_test.head()

    account_length  total_day_charge    total_eve_charge    total_night_charge  number_customer_service_calls   churn
0               74             31.91               13.89                 8.82                               0     NaN
1               57             30.06               16.58                 9.61                               0     NaN
2              111             36.43               17.72                 8.21                               1     NaN
3               77             42.81               17.48                12.38                               2     NaN
4               36             47.84               17.19                 8.42                               2     NaN

Here are the dtypes as well.这里也是dtypes。

clean_df_test.dtypes
account_length                     int64
total_day_charge                 float64
total_eve_charge                 float64
total_night_charge               float64
number_customer_service_calls      int64
churn                            float64
dtype: object

The main problem is that I am used to using sklearn's train_test_split() function on one dataset where as here I have 2 separate datasets so I am not sure what to set my y-test to be.主要问题是我习惯于在一个数据集上使用 sklearn 的train_test_split() function,因为这里我有 2 个单独的数据集,所以我不确定将我的 y 测试设置为什么。

The problem becomes evident by looking at clean_df_test.head() .通过查看clean_df_test.head()问题变得很明显。 I can see there are null values in the column churn .我可以看到churn列中有 null 值。

As a consequence, y_test contains null values, and by passing it as y_true to classification_report() , you are making the function compare nulls against integers, which is raising an error.因此, y_test包含 null 值,并将其作为y_true传递给classification_report() ,您正在使 function 将空值与整数进行比较,这会引发错误。

To solve this, try dropping the rows where churn is NaN and run the rest of your code as before.要解决此问题,请尝试删除churnNaN的行并像以前一样运行代码的 rest。

# Drop records where `churn` is NaN
clean_df_test.dropna(axis=0, subset=['churn'], inplace=True)

# Carry on as before
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_test = clean_df_test['churn']

Another way of spotting this issue is to look at the data types of clean_df_test .发现此问题的另一种方法是查看clean_df_test的数据类型。 From the output, churn 's type is float , which should not be the case if it was filled exclusively with ones and zeros!从 output 开始, churn的类型是float ,如果它完全用 1 和 0 填充,则不应该是这种情况!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将数据拆分为测试和训练,在 Pandas 中制作逻辑回归模型 - splitting data into test and train, making a logistic regression model in pandas 训练零件的逻辑回归模型以获取大数据 - Train a logistic regression model in parts for big data 我的训练/测试 model 返回错误并且是训练/测试 model 和正常线性回归 model 两个单独的模型? - My train/test model is returning an error and is train/test model and normal linear regression model two separate models? python,测试集和训练集中的逻辑回归 - logistic regression in python, Test set and Train set 如何对测试数据使用逻辑回归 - How to use logistic regression on test data 如何对训练和测试数据进行逻辑回归? - How to do logistic regression on training and test data? 如何使用逻辑回归训练高度不平衡的数据进行链接预测 - How to train a highly unbalanced data for link prediction using logistic regression Logistic回归sklearn-训练和应用模型 - Logistic regression sklearn - train and apply model 从头开始使用正则化 model 训练逻辑回归 - Train a logistic regression with regularization model from scratch 要使用测试和训练数据进行分组预测,请按多列分组 - To make group by predictions using test & train data, group by multiple columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM