简体   繁体   English

当转换为data.matrix的数字data.frame时,R对csv做什么?

[英]what is R doing to a csv when converting to numeric data.frame of data.matrix?

I have a CSV file. 我有一个CSV文件。 It is located in the scikit.learn library. 它位于scikit.learn库中。 Before building any predictive models in python, I would like to look at the correlation of every attribute with the key attribute. 在用python构建任何预测模型之前,我想看一下每个属性与key属性的相关性。 So, I imported the CSV file like so: 因此,我导入了CSV文件,如下所示:

 y <-read.csv("boston_house_prices.csv")

Now, I cant seem to perform any descriptive stats, or run cor(y[,1:13],y[,14]). 现在,我似乎无法执行任何描述性统计信息,也无法运行cor(y [,1:13],y [,14])。 It says that 'x' is not numeric. 它说“ x”不是数字。 I have tried: 我努力了:

 y <- as.data.frame(sapply(y, as.numeric))

and

 y <- data.matrix(y)

Now, the data is numeric and I can run the correlation. 现在,数据为数字,我可以运行相关性了。 However, if I wanted to run basic statistics, then everything is skewed from the "transformation" that occurred. 但是,如果我想运行基本统计信息,那么一切都会从发生的“转换”中倾斜。 Can someone tell me how to preserve the numeric type native to my data while being able to run cor()? 有人可以告诉我如何在运行cor()的同时保留数据本机的数字类型吗? Why does R have to transform the double/ decimal values to integers to operate? 为什么R必须将双精度/十进制值转换为整数才能进行运算?

Thanks. 谢谢。

You can avoid this issue by using skip = 1 when reading the data with read.csv . 使用read.csv读取数据时,可以通过使用skip = 1来避免此问题。 I grabbed a few lines from the raw data and it seems to work okay. 我从原始数据中抓取了几行,看来还可以。

The first line is unnecessary and it actually pushes the header line down into the first row, which in turn converts the columns to factors upon reading. 第一行是不必要的,它实际上将标题行向下推到第一行中,这反过来又在读取时将列转换为因子。 When you use as.numeric , you are actually changing all the factor values to their numeric values, which are not the same as the original numeric values and are likely incorrect. 当使用as.numeric ,实际上是将所有因子值更改为其数值,这些数值与原始数值不同,并且可能不正确。 This is the "skew" you describe. 这就是您描述的“偏斜”。

txt <- '506,13,,,,,,,,,,,,
  "CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
  0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
  0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
  0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
  0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4'

Your current call produces factors: 您当前的通话会产生以下因素:

sapply(read.csv(text = txt), class)
#     X506      X13        X      X.1      X.2      X.3      X.4 
# "factor" "factor" "factor" "factor" "factor" "factor" "factor" 
#      X.5      X.6      X.7      X.8      X.9     X.10     X.11 
# "factor" "factor" "factor" "factor" "factor" "factor" "factor" 

skip = 1 seems to do the trick, as it produces numeric columns: skip = 1似乎可以解决问题,因为它会产生数字列:

sapply(read.csv(text = txt, skip = 1), class)
#      CRIM        ZN     INDUS      CHAS       NOX        RM       AGE 
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric" 
#       DIS       RAD       TAX   PTRATIO         B     LSTAT      MEDV 
# "numeric" "integer" "integer" "numeric" "numeric" "numeric" "numeric" 

So if you change your first line to 因此,如果您将第一行更改为

y <- read.csv("boston_house_prices.csv", skip = 1)

everything should be fine after that with no other conversion necessary 此后一切都应该没事了,不需要其他转换

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM