I have a CSV file. It is located in the scikit.learn library. Before building any predictive models in python, I would like to look at the correlation of every attribute with the key attribute. So, I imported the CSV file like so:
y <-read.csv("boston_house_prices.csv")
Now, I cant seem to perform any descriptive stats, or run cor(y[,1:13],y[,14]). It says that 'x' is not numeric. I have tried:
y <- as.data.frame(sapply(y, as.numeric))
and
y <- data.matrix(y)
Now, the data is numeric and I can run the correlation. However, if I wanted to run basic statistics, then everything is skewed from the "transformation" that occurred. Can someone tell me how to preserve the numeric type native to my data while being able to run cor()? Why does R have to transform the double/ decimal values to integers to operate?
Thanks.
You can avoid this issue by using skip = 1
when reading the data with read.csv
. I grabbed a few lines from the raw data and it seems to work okay.
The first line is unnecessary and it actually pushes the header line down into the first row, which in turn converts the columns to factors upon reading. When you use as.numeric
, you are actually changing all the factor values to their numeric values, which are not the same as the original numeric values and are likely incorrect. This is the "skew" you describe.
txt <- '506,13,,,,,,,,,,,,
"CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4'
Your current call produces factors:
sapply(read.csv(text = txt), class)
# X506 X13 X X.1 X.2 X.3 X.4
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"
# X.5 X.6 X.7 X.8 X.9 X.10 X.11
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"
skip = 1
seems to do the trick, as it produces numeric columns:
sapply(read.csv(text = txt, skip = 1), class)
# CRIM ZN INDUS CHAS NOX RM AGE
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric"
# DIS RAD TAX PTRATIO B LSTAT MEDV
# "numeric" "integer" "integer" "numeric" "numeric" "numeric" "numeric"
So if you change your first line to
y <- read.csv("boston_house_prices.csv", skip = 1)
everything should be fine after that with no other conversion necessary
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.