How to make predictions on a logistic regression model with a separate df for train and test data

Question

I am working on a logistic regression model. I started out with two separate CSV files, one for training data and one for testing data. I created two separate data frames, one for each data set. I am able to fit and train the model just fine but am getting an error when I try to make predictions using the test data.

I am not sure if I am setting my y_train variable properly or if there is another issue going on. I get the following error messages when I run the prediction.

Here is the setup and code for the model"

#Setting x and y values
X_train = clean_df_train[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_train = clean_df_train['churn']

X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_test = clean_df_test['churn']

#Fitting / Training the Logistic Regression Model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

#Make Predictions with Logit Model
predictions = logreg.predict(X_test)

#Measure Performance of the model
from sklearn.metrics import classification_report

#Measure performance of the model
classification_report(y_test, predictions)

  1522     """
   1523 
-> 1524     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1525 
   1526     labels_given = True

E:\Users\davidwool\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     79     if len(y_type) > 1:
     80         raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81                          "and {1} targets".format(type_true, type_pred))
     82 
     83     # We can't have more than one value on y_type => The set is no more needed

ValueError: Classification metrics can't handle a mix of continuous and binary targets

Here is the head of the data that I am working with. The churn column is completely blank as it is what I am trying to predict.

clean_df_test.head()

    account_length  total_day_charge    total_eve_charge    total_night_charge  number_customer_service_calls   churn
0               74             31.91               13.89                 8.82                               0     NaN
1               57             30.06               16.58                 9.61                               0     NaN
2              111             36.43               17.72                 8.21                               1     NaN
3               77             42.81               17.48                12.38                               2     NaN
4               36             47.84               17.19                 8.42                               2     NaN

Here are the dtypes as well.

clean_df_test.dtypes
account_length                     int64
total_day_charge                 float64
total_eve_charge                 float64
total_night_charge               float64
number_customer_service_calls      int64
churn                            float64
dtype: object

The main problem is that I am used to using sklearn's train_test_split() function on one dataset where as here I have 2 separate datasets so I am not sure what to set my y-test to be.

Answer 1

The problem becomes evident by looking at clean_df_test.head() . I can see there are null values in the column churn .

As a consequence, y_test contains null values, and by passing it as y_true to classification_report() , you are making the function compare nulls against integers, which is raising an error.

To solve this, try dropping the rows where churn is NaN and run the rest of your code as before.

# Drop records where `churn` is NaN
clean_df_test.dropna(axis=0, subset=['churn'], inplace=True)

# Carry on as before
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_test = clean_df_test['churn']

Another way of spotting this issue is to look at the data types of clean_df_test . From the output, churn 's type is float , which should not be the case if it was filled exclusively with ones and zeros!

How to make predictions on a logistic regression model with a separate df for train and test data

Question

1 answers

solution1
2 2021-03-27 04:15:28

How to make predictions on a logistic regression model with a separate df for train and test data

Question

1 answers

solution1 2 2021-03-27 04:15:28

solution1
2 2021-03-27 04:15:28