简体   繁体   中英

what is R doing to a csv when converting to numeric data.frame of data.matrix?

I have a CSV file. It is located in the scikit.learn library. Before building any predictive models in python, I would like to look at the correlation of every attribute with the key attribute. So, I imported the CSV file like so:

 y <-read.csv("boston_house_prices.csv")

Now, I cant seem to perform any descriptive stats, or run cor(y[,1:13],y[,14]). It says that 'x' is not numeric. I have tried:

 y <- as.data.frame(sapply(y, as.numeric))

and

 y <- data.matrix(y)

Now, the data is numeric and I can run the correlation. However, if I wanted to run basic statistics, then everything is skewed from the "transformation" that occurred. Can someone tell me how to preserve the numeric type native to my data while being able to run cor()? Why does R have to transform the double/ decimal values to integers to operate?

Thanks.

You can avoid this issue by using skip = 1 when reading the data with read.csv . I grabbed a few lines from the raw data and it seems to work okay.

The first line is unnecessary and it actually pushes the header line down into the first row, which in turn converts the columns to factors upon reading. When you use as.numeric , you are actually changing all the factor values to their numeric values, which are not the same as the original numeric values and are likely incorrect. This is the "skew" you describe.

txt <- '506,13,,,,,,,,,,,,
  "CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
  0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
  0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
  0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
  0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4'

Your current call produces factors:

sapply(read.csv(text = txt), class)
#     X506      X13        X      X.1      X.2      X.3      X.4 
# "factor" "factor" "factor" "factor" "factor" "factor" "factor" 
#      X.5      X.6      X.7      X.8      X.9     X.10     X.11 
# "factor" "factor" "factor" "factor" "factor" "factor" "factor" 

skip = 1 seems to do the trick, as it produces numeric columns:

sapply(read.csv(text = txt, skip = 1), class)
#      CRIM        ZN     INDUS      CHAS       NOX        RM       AGE 
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric" 
#       DIS       RAD       TAX   PTRATIO         B     LSTAT      MEDV 
# "numeric" "integer" "integer" "numeric" "numeric" "numeric" "numeric" 

So if you change your first line to

y <- read.csv("boston_house_prices.csv", skip = 1)

everything should be fine after that with no other conversion necessary

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM