簡體   English   中英

如何使用單獨的 df 對訓練和測試數據進行邏輯回歸 model 的預測

[英]How to make predictions on a logistic regression model with a separate df for train and test data

我正在研究邏輯回歸 model。 我從兩個獨立的 CSV 文件開始,一個用於訓練數據,一個用於測試數據。 我創建了兩個單獨的數據框,每個數據集一個。 我能夠很好地擬合和訓練 model,但是當我嘗試使用測試數據進行預測時出現錯誤。

我不確定我是否正確設置了 y_train 變量,或者是否還有其他問題。 運行預測時,我收到以下錯誤消息。

這是模型的設置和代碼”

#Setting x and y values
X_train = clean_df_train[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_train = clean_df_train['churn']

X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_test = clean_df_test['churn']
#Fitting / Training the Logistic Regression Model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
#Make Predictions with Logit Model
predictions = logreg.predict(X_test)

#Measure Performance of the model
from sklearn.metrics import classification_report

#Measure performance of the model
classification_report(y_test, predictions)
  1522     """
   1523 
-> 1524     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1525 
   1526     labels_given = True

E:\Users\davidwool\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     79     if len(y_type) > 1:
     80         raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81                          "and {1} targets".format(type_true, type_pred))
     82 
     83     # We can't have more than one value on y_type => The set is no more needed

ValueError: Classification metrics can't handle a mix of continuous and binary targets

這是我正在使用的數據的負責人。 流失列是完全空白的,因為這是我想要預測的。

clean_df_test.head()

    account_length  total_day_charge    total_eve_charge    total_night_charge  number_customer_service_calls   churn
0               74             31.91               13.89                 8.82                               0     NaN
1               57             30.06               16.58                 9.61                               0     NaN
2              111             36.43               17.72                 8.21                               1     NaN
3               77             42.81               17.48                12.38                               2     NaN
4               36             47.84               17.19                 8.42                               2     NaN

這里也是dtypes。

clean_df_test.dtypes
account_length                     int64
total_day_charge                 float64
total_eve_charge                 float64
total_night_charge               float64
number_customer_service_calls      int64
churn                            float64
dtype: object

主要問題是我習慣於在一個數據集上使用 sklearn 的train_test_split() function,因為這里我有 2 個單獨的數據集,所以我不確定將我的 y 測試設置為什么。

通過查看clean_df_test.head()問題變得很明顯。 我可以看到churn列中有 null 值。

因此, y_test包含 null 值,並將其作為y_true傳遞給classification_report() ,您正在使 function 將空值與整數進行比較,這會引發錯誤。

要解決此問題,請嘗試刪除churnNaN的行並像以前一樣運行代碼的 rest。

# Drop records where `churn` is NaN
clean_df_test.dropna(axis=0, subset=['churn'], inplace=True)

# Carry on as before
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge', 
            'number_customer_service_calls']]
y_test = clean_df_test['churn']

發現此問題的另一種方法是查看clean_df_test的數據類型。 從 output 開始, churn的類型是float ,如果它完全用 1 和 0 填充,則不應該是這種情況!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM