[英]How to use logistic regression on test data
I am using Logistic Regression on my Titanic model and PyCharm is asking me to pass DataFrames with bool values only:我在泰坦尼克号模型上使用逻辑回归,而 PyCharm 要求我只传递带有布尔值的数据帧:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
predictions = logReg.predict(test[test_data])
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 2914, in __getitem__
return self._getitem_frame(key)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 3009, in _getitem_frame
raise ValueError('Must pass DataFrame with boolean values only')
ValueError: Must pass DataFrame with boolean values only
I don't understand why because the exact same features were used on Logistic Regression while training the model and it was well received then.我不明白为什么,因为在训练模型时在逻辑回归上使用了完全相同的特征,并且当时很受欢迎。 Here is my code (ignore the code repetition. That's a problem I'm going to tackle after):这是我的代码(忽略代码重复。这是我将要解决的问题):
import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore", category=FutureWarning)
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")
train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
train['HasCabin'] = train['Cabin'].notnull().astype(int)
train['Relatives'] = train['SibSp'] + train['Parch']
train_data = train[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
x_train, x_validate, y_train, y_validate = train_test_split(train_data, train['Survived'], test_size=0.22, random_state=0)
test['Sex'] = test['Sex'].replace(['female', 'male'], [0, 1])
test['Embarked'] = test['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
test['Age'].fillna(test.groupby('Sex')['Age'].transform("median"), inplace=True)
test['HasCabin'] = test['Cabin'].notnull().astype(int)
test['Relatives'] = test['SibSp'] + test['Parch']
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
logReg = LogisticRegression()
logReg.fit(x_train, y_train)
predictions = logReg.predict(test[test_data])
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})
filename = 'Titanic-Submission.csv'
submission.to_csv(filename, index=False)
Specifically, Python takes issue with this snippet:具体来说,Python 对这个片段提出了问题:
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
...
predictions = logReg.predict(test[test_data])
UPDATE更新
I've changed my predictions
variable to this:我已将我的predictions
变量更改为:
predictions = logReg.predict(test_data)
And now this is my stacktrace:现在这是我的堆栈跟踪:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
predictions = logReg.predict(test_data)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 281, in predict
scores = self.decision_function(X)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 257, in decision_function
X = check_array(X, accept_sparse='csr')
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 573, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Which means that my feature selection/engineering for the test data does not go through这意味着我对测试数据的特征选择/工程没有通过
You have a NaN
value in the Fare
column which you don't take care of.您在Fare
列中有一个NaN
值,但您不关心。 Replacing it similar as you do to Age
takes care of the problem.像更换Age
一样更换它可以解决这个问题。 Is this the best solution for the model?这是模型的最佳解决方案吗? That's a different argument but this gets rid of the problem.这是一个不同的论点,但这解决了问题。
train['Fare'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
test['Fare'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
Predictions with x_validate
work no problem.使用x_validate
预测没有问题。 Try:尝试:
>>> predictions = logReg.predict(x_validate)
So there must be something wrong with test_data
.所以test_data
肯定有问题。 Get some information on the dataframes and compare:获取有关数据帧的一些信息并进行比较:
>>> x_validate.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 197 entries, 495 to 45
Data columns (total 7 columns):
Pclass 197 non-null int64
Sex 197 non-null int64
Relatives 197 non-null int64
Fare 197 non-null float64
Age 197 non-null float64
Embarked 197 non-null int64
HasCabin 197 non-null int64
dtypes: float64(2), int64(5)
memory usage: 12.3 KB
>>> test_data.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass 418 non-null int64
Sex 418 non-null int64
Relatives 418 non-null int64
Fare 417 non-null float64
Age 418 non-null float64
Embarked 418 non-null int64
HasCabin 418 non-null int64
dtypes: float64(2), int64(5)
memory usage: 22.9 KB
Looks like there's a NaN here:看起来这里有一个 NaN:
Fare 417 non-null float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.