简体   繁体   中英

R Trouble getting correlation coefficient

I'm getting difficulties on my quest to get a correlation coefficient for my data set. I started by using ggpairs and then cor function.

It might sound a lack of knowledge, but I didn't realize that I can't calculate the correlation for columns which type is not numeric. For example, I would like to now the correlation between some AGE and CITY. What alternative do I have to situations like this? Or what data transformations I should do?

Thank you.

As thelatemail put it, sometimes graphs speak more than a stat...

cities <- c("Montreal", "Toronto", "New York", "Plattsburgh")
dat <- data.frame(city = sample(cities,size = 200, replace = TRUE), age = rnorm(n = 200, mean = 40, sd = 20))
dat$city <- as.factor(dat$city)
plot(age ~ city, data = dat)

Then for proper analysis you have several options... anova, or regression with cities as an explanatory variable (factor)... Although your question might have better responses on Cross Validated!

Btw: pls just ignore negative ages, this has been done quickly.

在此处输入图片说明

I think you first need to answer the question of what it is you are trying to do. The correlation coefficient (Pearson's r) is a specific statistic that can be calculated on two numerical values (where a dichotomous variable can be considered numeric). It has some special characteristics, including that it is bounded by -1 and 1 and that it does not have a concept of dependent or independent variable. Also it does not represent the proportion of variance explained; you need to square it to get the usual measure of that. What it does do is give you an estimate of the size and direction of the association between two variables.

These characteristics make it inappropriate to use r when you have a variable such as city as one of the two variables. If you want to know the proportion of variance in age explained by city, you can run a regression of age on a set of dummy variables for city and look at the overall R squared for the model. However unlike r, you won't have a simple direction (just direction for each city) and it won't necessarily be the same as if you built a model predicting city based on age.

Regarding the qualitative data such as City, you can use the Spearman's correlation.

You can find more information about this correlation here

It can be simply used in R with the help of this command :

cor(x, use=, method= )

So , if you want to use it in a simple example :

cor(AGE, CITY, method = "Spearman")

I hope that helps you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM