Data set link: https://www.kaggle.com/blastchar/telco-customer-churn
What is the cause of the difference in the data type of column "TotalCharges" read by R and pandas? The column in pandas is expected to be a numeric type, not object.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ch_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
ch_data.info()
Result:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID 7043 non-null object
gender 7043 non-null object
SeniorCitizen 7043 non-null int64
Partner 7043 non-null object
Dependents 7043 non-null object
tenure 7043 non-null int64
PhoneService 7043 non-null object
MultipleLines 7043 non-null object
InternetService 7043 non-null object
OnlineSecurity 7043 non-null object
OnlineBackup 7043 non-null object
DeviceProtection 7043 non-null object
TechSupport 7043 non-null object
StreamingTV 7043 non-null object
StreamingMovies 7043 non-null object
Contract 7043 non-null object
PaperlessBilling 7043 non-null object
PaymentMethod 7043 non-null object
MonthlyCharges 7043 non-null float64
TotalCharges 7043 non-null object
Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
The data type of 'TotalCharges' is object.
gg<-read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
str(gg)
Result:
'data.frame': 7043 obs. of 21 variables:
$ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
$ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
$ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
$ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
$ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
$ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
$ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
The data type for TotalCharges is num.
This is caused by different policies in handling space characters. You can use regex separator in pd.read_csv(sep=) to "eat up" columns comprise of solely spaces:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=r"\,\s*", engine='python')
df.dtypes
Out[19]:
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges float64 <- correct
Churn object
# space -> nan
df["TotalCharges"][488]
Out[23]: nan
You can see that TotalCharges
is read correctly.
NB The rows containing space characters was found this way:
df = pd.read_csv("/mnt/ramdisk/WA_Fn-UseC_-Telco-Customer-Churn.csv")
for i in range(len(df)):
try:
_ = float(df["TotalCharges"][i])
except ValueError:
print(f'float() error: row={i}, val="{df.TotalCharges[i]}"')
# result
float() error: row=488, val=" "
float() error: row=753, val=" "
float() error: row=936, val=" "
float() error: row=1082, val=" "
float() error: row=1340, val=" "
float() error: row=3331, val=" "
float() error: row=3826, val=" "
float() error: row=4380, val=" "
float() error: row=5218, val=" "
float() error: row=6670, val=" "
float() error: row=6754, val=" "
At the same time, R also decides to encode text as categorical variables internally, while pandas does not. Generally speaking, R tries to be a little "smarter" for the potential convenience of data analysts, as R is designed for statistical/analytical purposes. This may or may not bring you trouble. In the contrary, Pandas is more general-purpose, so it does less assumptions for the sake of consistency . So it is just a different choice of the philosophy for function design, and any such perspective would always be purely opinion-based, even if answered by the creators of the functions themselves .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.