[英]splitting data into test and train, making a logistic regression model in pandas
[英]How to make predictions on a logistic regression model with a separate df for train and test data
我正在研究逻辑回归 model。 我从两个独立的 CSV 文件开始,一个用于训练数据,一个用于测试数据。 我创建了两个单独的数据框,每个数据集一个。 我能够很好地拟合和训练 model,但是当我尝试使用测试数据进行预测时出现错误。
我不确定我是否正确设置了 y_train 变量,或者是否还有其他问题。 运行预测时,我收到以下错误消息。
这是模型的设置和代码”
#Setting x and y values
X_train = clean_df_train[['account_length','total_day_charge','total_eve_charge', 'total_night_charge',
'number_customer_service_calls']]
y_train = clean_df_train['churn']
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge',
'number_customer_service_calls']]
y_test = clean_df_test['churn']
#Fitting / Training the Logistic Regression Model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='warn',
n_jobs=None, penalty='l2', random_state=None, solver='warn',
tol=0.0001, verbose=0, warm_start=False)
#Make Predictions with Logit Model
predictions = logreg.predict(X_test)
#Measure Performance of the model
from sklearn.metrics import classification_report
#Measure performance of the model
classification_report(y_test, predictions)
1522 """
1523
-> 1524 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
1525
1526 labels_given = True
E:\Users\davidwool\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
79 if len(y_type) > 1:
80 raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81 "and {1} targets".format(type_true, type_pred))
82
83 # We can't have more than one value on y_type => The set is no more needed
ValueError: Classification metrics can't handle a mix of continuous and binary targets
这是我正在使用的数据的负责人。 流失列是完全空白的,因为这是我想要预测的。
clean_df_test.head()
account_length total_day_charge total_eve_charge total_night_charge number_customer_service_calls churn
0 74 31.91 13.89 8.82 0 NaN
1 57 30.06 16.58 9.61 0 NaN
2 111 36.43 17.72 8.21 1 NaN
3 77 42.81 17.48 12.38 2 NaN
4 36 47.84 17.19 8.42 2 NaN
这里也是dtypes。
clean_df_test.dtypes
account_length int64
total_day_charge float64
total_eve_charge float64
total_night_charge float64
number_customer_service_calls int64
churn float64
dtype: object
主要问题是我习惯于在一个数据集上使用 sklearn 的train_test_split()
function,因为这里我有 2 个单独的数据集,所以我不确定将我的 y 测试设置为什么。
通过查看clean_df_test.head()
问题变得很明显。 我可以看到churn
列中有 null 值。
因此, y_test
包含 null 值,并将其作为y_true
传递给classification_report()
,您正在使 function 将空值与整数进行比较,这会引发错误。
要解决此问题,请尝试删除churn
为NaN
的行并像以前一样运行代码的 rest。
# Drop records where `churn` is NaN
clean_df_test.dropna(axis=0, subset=['churn'], inplace=True)
# Carry on as before
X_test = clean_df_test[['account_length','total_day_charge','total_eve_charge', 'total_night_charge',
'number_customer_service_calls']]
y_test = clean_df_test['churn']
发现此问题的另一种方法是查看clean_df_test
的数据类型。 从 output 开始, churn
的类型是float
,如果它完全用 1 和 0 填充,则不应该是这种情况!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.