简体   繁体   English

R相关系数获取困难

[英]R Trouble getting correlation coefficient

I'm getting difficulties on my quest to get a correlation coefficient for my data set. 我在获取数据集相关系数的过程中遇到了困难。 I started by using ggpairs and then cor function. 我先使用ggpairs ,然后使用cor函数。

It might sound a lack of knowledge, but I didn't realize that I can't calculate the correlation for columns which type is not numeric. 听起来可能缺乏知识,但我没有意识到我无法为非数字类型的列计算相关性。 For example, I would like to now the correlation between some AGE and CITY. 例如,我现在想了解一些年龄和城市之间的关系。 What alternative do I have to situations like this? 对于这种情况,我有什么选择? Or what data transformations I should do? 还是我应该做什么数据转换?

Thank you. 谢谢。

As thelatemail put it, sometimes graphs speak more than a stat... 正如thelatemail所说的,有时候图表所讲的不仅仅只是一种统计数据。

cities <- c("Montreal", "Toronto", "New York", "Plattsburgh")
dat <- data.frame(city = sample(cities,size = 200, replace = TRUE), age = rnorm(n = 200, mean = 40, sd = 20))
dat$city <- as.factor(dat$city)
plot(age ~ city, data = dat)

Then for proper analysis you have several options... anova, or regression with cities as an explanatory variable (factor)... Although your question might have better responses on Cross Validated! 然后,为了进行适当的分析,您有几种选择...方差分析,或将城市作为回归变量(因子)的回归...尽管您的问题可能对“交叉验证”有更好的回答!

Btw: pls just ignore negative ages, this has been done quickly. 顺便说一句:请只是忽略负面年龄,这已经很快完成了。

在此处输入图片说明

I think you first need to answer the question of what it is you are trying to do. 我认为您首先需要回答您要做什么的问题。 The correlation coefficient (Pearson's r) is a specific statistic that can be calculated on two numerical values (where a dichotomous variable can be considered numeric). 相关系数(Pearson的r)是可以根据两个数值(其中二分变量可以视为数值)计算的特定统计量。 It has some special characteristics, including that it is bounded by -1 and 1 and that it does not have a concept of dependent or independent variable. 它具有一些特殊的特征,包括以-1和1为界,并且没有因变量或自变量的概念。 Also it does not represent the proportion of variance explained; 同样,它也不代表所解释的方差的比例; you need to square it to get the usual measure of that. 您需要对其求平方以得到通常的度量。 What it does do is give you an estimate of the size and direction of the association between two variables. 它的作用是为您估计两个变量之间关联的大小和方向。

These characteristics make it inappropriate to use r when you have a variable such as city as one of the two variables. 这些特性使得在将诸如city这样的变量作为两个变量之一的情况下,不宜使用r。 If you want to know the proportion of variance in age explained by city, you can run a regression of age on a set of dummy variables for city and look at the overall R squared for the model. 如果您想了解由城市解释的年龄变化的比例,可以对城市的一组虚拟变量进行年龄回归,并查看模型的总体R平方。 However unlike r, you won't have a simple direction (just direction for each city) and it won't necessarily be the same as if you built a model predicting city based on age. 但是,与r不同,您不会有一个简单的方向(每个城市都只有一个方向),并且不一定与建立基于年龄的城市预测模型时的方向相同。

Regarding the qualitative data such as City, you can use the Spearman's correlation. 关于诸如City之类的定性数据,您可以使用Spearman的相关性。

You can find more information about this correlation here 您可以在此处找到有关此关联的更多信息

It can be simply used in R with the help of this command : 可以通过以下命令在R中简单地使用它:

cor(x, use=, method= ) cor(x,use =,method =)

So , if you want to use it in a simple example : 因此,如果您想在一个简单的示例中使用它:

cor(AGE, CITY, method = "Spearman") cor(AGE,CITY,method =“ Spearman”)

I hope that helps you 希望对您有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM