[英]A csv column was read as "num" in R but "object" in pandas.read_csv()
数据集链接: https : //www.kaggle.com/blastchar/telco-customer-churn
R和pandas读取的列“TotalCharges”的数据类型不同的原因是什么? pandas 中的列应该是数字类型,而不是对象。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
ch_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
ch_data.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID 7043 non-null object
gender 7043 non-null object
SeniorCitizen 7043 non-null int64
Partner 7043 non-null object
Dependents 7043 non-null object
tenure 7043 non-null int64
PhoneService 7043 non-null object
MultipleLines 7043 non-null object
InternetService 7043 non-null object
OnlineSecurity 7043 non-null object
OnlineBackup 7043 non-null object
DeviceProtection 7043 non-null object
TechSupport 7043 non-null object
StreamingTV 7043 non-null object
StreamingMovies 7043 non-null object
Contract 7043 non-null object
PaperlessBilling 7043 non-null object
PaymentMethod 7043 non-null object
MonthlyCharges 7043 non-null float64
TotalCharges 7043 non-null object
Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
“TotalCharges”的数据类型是对象。
gg<-read.csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
str(gg)
结果:
'data.frame': 7043 obs. of 21 variables:
$ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
$ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
$ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
$ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
$ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
$ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
$ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
$ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
$ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
$ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
$ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
$ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
$ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
TotalCharges的数据类型是 num。
这是由处理空格字符的不同策略引起的。 您可以在pd.read_csv(sep=) 中使用正则表达式分隔符来“吃掉”仅包含空格的列:
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv", sep=r"\,\s*", engine='python')
df.dtypes
Out[19]:
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges float64 <- correct
Churn object
# space -> nan
df["TotalCharges"][488]
Out[23]: nan
您可以看到TotalCharges
被正确读取。
注意包含空格字符的行是通过这种方式找到的:
df = pd.read_csv("/mnt/ramdisk/WA_Fn-UseC_-Telco-Customer-Churn.csv")
for i in range(len(df)):
try:
_ = float(df["TotalCharges"][i])
except ValueError:
print(f'float() error: row={i}, val="{df.TotalCharges[i]}"')
# result
float() error: row=488, val=" "
float() error: row=753, val=" "
float() error: row=936, val=" "
float() error: row=1082, val=" "
float() error: row=1340, val=" "
float() error: row=3331, val=" "
float() error: row=3826, val=" "
float() error: row=4380, val=" "
float() error: row=5218, val=" "
float() error: row=6670, val=" "
float() error: row=6754, val=" "
同时,R 还决定在内部将文本编码为分类变量,而 Pandas 则没有。 一般来说,为了数据分析师的潜在便利,R 试图变得更“聪明”一点,因为 R 是为统计/分析目的而设计的。 这可能会也可能不会给您带来麻烦。 相反,Pandas 更通用,所以为了一致性,它做的假设更少。 所以这只是功能设计哲学的不同选择,任何这样的观点总是纯粹基于意见,即使由功能的创造者自己回答。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.