what is R doing to a csv when converting to numeric data.frame of data.matrix?

Question

I have a CSV file. It is located in the scikit.learn library. Before building any predictive models in python, I would like to look at the correlation of every attribute with the key attribute. So, I imported the CSV file like so:

 y <-read.csv("boston_house_prices.csv")

Now, I cant seem to perform any descriptive stats, or run cor(y[,1:13],y[,14]). It says that 'x' is not numeric. I have tried:

 y <- as.data.frame(sapply(y, as.numeric))

and

 y <- data.matrix(y)

Now, the data is numeric and I can run the correlation. However, if I wanted to run basic statistics, then everything is skewed from the "transformation" that occurred. Can someone tell me how to preserve the numeric type native to my data while being able to run cor()? Why does R have to transform the double/ decimal values to integers to operate?

Thanks.

Answer 1

You can avoid this issue by using skip = 1 when reading the data with read.csv . I grabbed a few lines from the raw data and it seems to work okay.

The first line is unnecessary and it actually pushes the header line down into the first row, which in turn converts the columns to factors upon reading. When you use as.numeric , you are actually changing all the factor values to their numeric values, which are not the same as the original numeric values and are likely incorrect. This is the "skew" you describe.

txt <- '506,13,,,,,,,,,,,,
  "CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
  0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
  0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
  0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
  0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4'

Your current call produces factors:

sapply(read.csv(text = txt), class)
#     X506      X13        X      X.1      X.2      X.3      X.4 
# "factor" "factor" "factor" "factor" "factor" "factor" "factor" 
#      X.5      X.6      X.7      X.8      X.9     X.10     X.11 
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"

skip = 1 seems to do the trick, as it produces numeric columns:

sapply(read.csv(text = txt, skip = 1), class)
#      CRIM        ZN     INDUS      CHAS       NOX        RM       AGE 
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric" 
#       DIS       RAD       TAX   PTRATIO         B     LSTAT      MEDV 
# "numeric" "integer" "integer" "numeric" "numeric" "numeric" "numeric"

So if you change your first line to

y <- read.csv("boston_house_prices.csv", skip = 1)

everything should be fine after that with no other conversion necessary

what is R doing to a csv when converting to numeric data.frame of data.matrix?

Question

1 answers

solution1
0 ACCPTED 2014-09-28 03:58:17

what is R doing to a csv when converting to numeric data.frame of data.matrix?

Question

1 answers

solution1 0 ACCPTED 2014-09-28 03:58:17

solution1
0 ACCPTED 2014-09-28 03:58:17