[英]A concatenated dataframe in python when exported to csv shows blank rows
在為 Titanic 數據集實施以下邏輯回歸時,將刪除沒有值的行。 但是當這些被刪除的行與預測連接時,它們仍然顯示為空白行。 為什么會這樣?
原始數據請參考https://www.kaggle.com/c/titanic/data 。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
dir_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(dir_path)
train = pd.read_csv('titanic_train.csv')
sns.set_style('whitegrid')
#Data cleaning
train.drop('Cabin',axis=1,inplace=True)
train.dropna(inplace=True)
#Categorical data to dummy vars
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
train.drop(['Sex','Embarked','Name','Ticket','PassengerId'],axis=1,inplace=True)
train = pd.concat([train,sex,embark],axis=1)
print(train.head())
#Develop model
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(train.drop('Survived',axis=1),train['Survived'],test_size=0.3,random_state=101)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions=logmodel.predict(X_test)
predser=pd.Series(predictions)
trainnew=pd.concat([X_test,y_test,predser],axis=1)
trainnew.to_csv('trainresults.csv',index=False)
清洗后的數據集樣本:
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1
5 0 3 24.0 0 0 8.4583 1 1 0
6 0 1 54.0 0 0 51.8625 1 0 1
7 0 3 2.0 3 1 21.0750 1 0 1
8 1 3 27.0 0 2 11.1333 0 0 1
9 1 2 14.0 1 0 30.0708 0 0 0
output csv 看起來像這樣
Pclass Age SibSp Parch Fare male Q S Survived SPred
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0
6 1.0 54.00 0.0 0.0 51.8625 1.0 0.0 1.0 0.0 0.0
7 3.0 2.00 3.0 1.0 21.0750 1.0 0.0 1.0 0.0 0.0
8 3.0 27.00 0.0 2.0 11.1333 0.0 0.0 1.0 1.0 0.0
9 2.0 14.00 1.0 0.0 30.0708 0.0 0.0 0.0 1.0 1.0
問題是predser
的索引與 X_test 和 y_test 的索引不一致。 解決它的一種方法是在concat
之前更改索引,如:
predser.index = y_test.index. # <== new line
trainnew=pd.concat([X_test,y_test,predser],axis=1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.