简体   繁体   中英

A csv column was read as "num" in R but "object" in pandas.read_csv()

Data set link: https://www.kaggle.com/blastchar/telco-customer-churn

What is the cause of the difference in the data type of column "TotalCharges" read by R and pandas? The column in pandas is expected to be a numeric type, not object.

Python pandas.read_csv()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ch_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
ch_data.info()

Result:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

The data type of 'TotalCharges' is object.

R read.csv()

gg<-read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
str(gg)

Result:

'data.frame':   7043 obs. of  21 variables:
 $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
 $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
 $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
 $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
 $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
 $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
 $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
 $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
 $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
 $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
 $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
 $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
 $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
 $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
 $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
 $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
 $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
 $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
 $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

The data type for TotalCharges is num.

This is caused by different policies in handling space characters. You can use regex separator in pd.read_csv(sep=) to "eat up" columns comprise of solely spaces:

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=r"\,\s*", engine='python')
df.dtypes
Out[19]: 
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64  <- correct
Churn                object

# space -> nan
df["TotalCharges"][488]
Out[23]: nan

You can see that TotalCharges is read correctly.

NB The rows containing space characters was found this way:

df = pd.read_csv("/mnt/ramdisk/WA_Fn-UseC_-Telco-Customer-Churn.csv")
for i in range(len(df)):
    try:
        _ = float(df["TotalCharges"][i])
    except ValueError:
        print(f'float() error: row={i}, val="{df.TotalCharges[i]}"')

# result
float() error: row=488, val=" "
float() error: row=753, val=" "
float() error: row=936, val=" "
float() error: row=1082, val=" "
float() error: row=1340, val=" "
float() error: row=3331, val=" "
float() error: row=3826, val=" "
float() error: row=4380, val=" "
float() error: row=5218, val=" "
float() error: row=6670, val=" "
float() error: row=6754, val=" "

At the same time, R also decides to encode text as categorical variables internally, while pandas does not. Generally speaking, R tries to be a little "smarter" for the potential convenience of data analysts, as R is designed for statistical/analytical purposes. This may or may not bring you trouble. In the contrary, Pandas is more general-purpose, so it does less assumptions for the sake of consistency . So it is just a different choice of the philosophy for function design, and any such perspective would always be purely opinion-based, even if answered by the creators of the functions themselves .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM