简体   繁体   English

如何对测试数据使用逻辑回归

[英]How to use logistic regression on test data

I am using Logistic Regression on my Titanic model and PyCharm is asking me to pass DataFrames with bool values only:我在泰坦尼克号模型上使用逻辑回归,而 PyCharm 要求我只传递带有布尔值的数据帧:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
    predictions = logReg.predict(test[test_data])
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 2914, in __getitem__
    return self._getitem_frame(key)
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 3009, in _getitem_frame
    raise ValueError('Must pass DataFrame with boolean values only')
ValueError: Must pass DataFrame with boolean values only

I don't understand why because the exact same features were used on Logistic Regression while training the model and it was well received then.我不明白为什么,因为在训练模型时在逻辑回归上使用了完全相同的特征,并且当时很受欢迎。 Here is my code (ignore the code repetition. That's a problem I'm going to tackle after):这是我的代码(忽略代码重复。这是我将要解决的问题):

import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore", category=FutureWarning)

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
train['HasCabin'] = train['Cabin'].notnull().astype(int)
train['Relatives'] = train['SibSp'] + train['Parch']
train_data = train[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
x_train, x_validate, y_train, y_validate = train_test_split(train_data, train['Survived'], test_size=0.22, random_state=0)

test['Sex'] = test['Sex'].replace(['female', 'male'], [0, 1])
test['Embarked'] = test['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
test['Age'].fillna(test.groupby('Sex')['Age'].transform("median"), inplace=True)
test['HasCabin'] = test['Cabin'].notnull().astype(int)
test['Relatives'] = test['SibSp'] + test['Parch']
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train)

predictions = logReg.predict(test[test_data])
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})

filename = 'Titanic-Submission.csv'
submission.to_csv(filename, index=False)

Specifically, Python takes issue with this snippet:具体来说,Python 对这个片段提出了问题:

test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]

...

predictions = logReg.predict(test[test_data])

UPDATE更新

I've changed my predictions variable to this:我已将我的predictions变量更改为:

predictions = logReg.predict(test_data)

And now this is my stacktrace:现在这是我的堆栈跟踪:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
    predictions = logReg.predict(test_data)
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 281, in predict
    scores = self.decision_function(X)
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 257, in decision_function
    X = check_array(X, accept_sparse='csr')
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 573, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Which means that my feature selection/engineering for the test data does not go through这意味着我对测试数据的特征选择/工程没有通过

You have a NaN value in the Fare column which you don't take care of.您在Fare列中有一个NaN值,但您不关心。 Replacing it similar as you do to Age takes care of the problem.像更换Age一样更换它可以解决这个问题。 Is this the best solution for the model?这是模型的最佳解决方案吗? That's a different argument but this gets rid of the problem.这是一个不同的论点,但这解决了问题。

train['Fare'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
test['Fare'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)

Predictions with x_validate work no problem.使用x_validate预测没有问题。 Try:尝试:

>>> predictions = logReg.predict(x_validate)

So there must be something wrong with test_data .所以test_data肯定有问题。 Get some information on the dataframes and compare:获取有关数据帧的一些信息并进行比较:

>>> x_validate.info(verbose=True)                                                                                                                                                          
<class 'pandas.core.frame.DataFrame'>
Int64Index: 197 entries, 495 to 45
Data columns (total 7 columns):
Pclass       197 non-null int64
Sex          197 non-null int64
Relatives    197 non-null int64
Fare         197 non-null float64
Age          197 non-null float64
Embarked     197 non-null int64
HasCabin     197 non-null int64
dtypes: float64(2), int64(5)
memory usage: 12.3 KB

>>> test_data.info(verbose=True)                                                                                                                                                           
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass       418 non-null int64
Sex          418 non-null int64
Relatives    418 non-null int64
Fare         417 non-null float64
Age          418 non-null float64
Embarked     418 non-null int64
HasCabin     418 non-null int64
dtypes: float64(2), int64(5)
memory usage: 22.9 KB

Looks like there's a NaN here:看起来这里有一个 NaN:

Fare         417 non-null float64    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM