简体   繁体   English

如何询问两个二元变量之间的相关性是否在 R 中的组之间变化

[英]How to ask if correlation between two binary variables varies between groups in R

This seems like a simple coding/statistics problem, but I've been working on this and reading about it for days, and I just can not seem to wrap my head around it ... I am a biologist, not a statistician, and any help would be appreciated.这似乎是一个简单的编码/统计问题,但我一直在研究这个问题并阅读了好几天,我似乎无法理解它......我是生物学家,不是统计学家,并且任何帮助,将不胜感激。

I am trying to find a way to ask if the degree of correlation or relationship between two binary variables (impacts present/absent ~ threats present/absent) varies between species, and separately, between categories.我试图找到一种方法来询问两个二元变量(存在/不存在的影响~存在/不存在的威胁)之间的相关性或关系的程度是否因物种而异,并且在类别之间是不同的。 Suggestions on better approaches/packages/ways to code what I'm after would be appreciated, as well as general input on what I've already done.对更好的方法/包/方法来编码我所追求的东西的建议,以及对我已经完成的工作的一般意见,将不胜感激。 My dataframe looks like this:我的数据框如下所示:

set.seed(123)
df <- data.frame(Species = rep(c("plant1", "plant2", "plant3", 
       "plant4", "plant5"), each=5), Category= rep(c("A", "B", 
       "C", "D","E"), 5), threat.count = sample(0:1, replace = T, 
       size = 25), impact.count = sample(0:1, replace = T, 
       size = 25)) 

I can ask what the general correlation between threats and impacts is with a non-parametric Spearman test for correlation between paired samples:我可以通过非参数 Spearman 检验来询问威胁和影响之间的一般相关性是什么,用于配对样本之间的相关性:

cor.test(df$threat.count, df$impact.count, method = "spearman", 
    exact = FALSE, conf.int=TRUE)  
# rho = -0.1666667; in my real data the correlation is much 
# higher, around 0.26. 

I would interpret this as: Overall, there is a 26% correlation between threats and impacts.我将其解释为:总体而言,威胁与影响之间存在 26% 的相关性。

However, I would like to dig in and ask if the degree of correlation varies between Species (and later between Categories), and if so, how (eg is the correlation between threats and impacts stronger for some species than for others?).但是,我想深入探讨一下,物种之间(以及后来的类别之间)的相关程度是否有所不同,如果是,则如何(例如,某些物种的威胁和影响之间的相关性是否比其他物种更强?)。

I have tried creating both generalized linear models, and generalized linear mixed models to get at this, and am not sure if either answers my questions and if I am interpreting them correctly.我已经尝试创建广义线性模型和广义线性混合模型来解决这个问题,但不确定是否能回答我的问题以及我是否正确解释了它们。

To ask if degree of correlation varies between species overall, I could do something like this:要询问整体物种之间的相关程度是否有所不同,我可以这样做:

mod0 <- glm(impact.count ~ threat.count, data = df, family = 
            binomial(link = "logit"))
mod1 <- glm(impact.count ~ threat.count + Species, data = df, 
            family = binomial(link = "logit"))
anova(mod0, mod1, test = 'LRT') # here, no, but in my real data, 
                                # yes

#all I can say from that would be 'Yes/no, the degree of correlation between threats and impacts varies between species'... but I would like to know how much? #我只能说'是/否,威胁和影响之间的相关程度因物种而异'......但我想知道多少?

So, we can look at the summary from mod1:所以,我们可以看一下 mod1 的总结:

summary(mod1)

As I understand it, in this output, the coefficient estimate for threat.count is the log-odds of an impact being present (impact.count = 1) with a 1-unit change in threat.count (aka if threat count = 1).据我了解,在此输出中,threat.count 的系数估计值是存在影响的对数几率 (impact.count = 1),其中threat.count 有 1 个单位的变化(也就是如果威胁计数 = 1 )。 The coefficient estimates for Species2-Species5 are the difference in log-odds between their respective coefficients and the coefficient for Species1 (the Intercept term). Species2-Species5 的系数估计值是它们各自系数与 Species1 的系数(截距项)之间的对数几率差。 I could get the "true" coefficients for all categories by running a no-intercept model like this:通过运行这样的无拦截模型,我可以获得所有类别的“真实”系数:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))
summary(mod2)

The coefficient estimates for Species1:Species5 here are the log-odds that an impact will be present (impact.count = 1) for each Species if threat count is ... constant(?) 1(?) mean= 0.5(?).此处Species1:Species5的系数估计值是如果威胁计数为 ... constant(?) 1(?) mean= 0.5(?) .

I could get the odds ratios and probabilities for either version with:我可以通过以下方式获得任一版本的优势比和概率:

exp(coef(modx))                     #get odds ratios 
exp(coef(modx))/(1+exp(coef(modx))) #get probabilities

My issue here is, how do I/can I interpret any of these in terms of whether/how much the correlation between threats and impacts varies between species?我的问题是,我/我如何解释这些中的任何一个,即威胁和影响之间的相关性是否/有多少因物种而异?

I have also tried making a generalized linear mixed model:我也尝试制作一个广义线性混合模型:

library(lme4)
mod3 <- glmer(impact.count ~ threat.count + (1|Species), data = 
              df, family = binomial(link = "logit"))
summary(mod3)

This gives me an estimate for threat.count , but I am running into the same interpretation problem as before.这给了我一个估计的threat.count ,但我threat.count了与以前相同的解释问题。

I also tried using lmList to look at the relationship between impact count and threat count separately for each species, but am worried about whether this is a statistically sound approach... any multiple comparisons issues?我还尝试使用 lmList 分别查看每个物种的影响计数和威胁计数之间的关系,但我担心这是否是一种统计上合理的方法......是否存在多重比较问题? Also, how would I get it to spit out whether the sub-models are significant?另外,我如何让它吐出子模型是否重要?

corr.spp.list <- lme4::lmList(impact.count ~ threat.count 
                 |Species, data = df, 
                 family = binomial(link = "logit"), 
                 warn = TRUE) #fitting each model separately by 
                              # species
corr.spp.list

I don't think glm function work.我认为glm函数不起作用。 For example, in this code:例如,在这段代码中:

mod2 <- glm(impact.count ~ 0 + threat.count + Species, data = df, 
            family = binomial(link = "logit"))

The coefficients actually estimate the effect of Species on impact when we controls for threat .当我们控制threat时,这些系数实际上估计了Speciesimpact It's not the correlation between count and impact for some specific species.这不是某些特定物种的countimpact之间的相关性。

You can use scale function to implement variable standardization for each specie:您可以使用scale函数为每个物种实现变量标准化:

library(tidyverse)

df_z_score <- df %>%
  group_by(Species) %>%
  mutate(threat_z = scale(threat.count),
         impact_z = scale(impact.count))

Then然后

lm(threat_z ~ impact_z + factor(Species)*impact_z, data = 
              df_z_score)

Because the regression coefficient of the standardized variable is equal to the correlation , the coefficient impact_z = -6.124e-01 is actually the correlation between impact and threat in the reference group.因为标准化变量的回归系数等于相关性,系数impact_z = -6.124e-01实际上是参考组中impactthreat的相关性。
The coefficient of interaction term is the change of correlation coefficient relative to the reference group.交互项系数是相关系数相对于参考组的变化。 P-value indicates whether the change is significant. P 值表示变化是否显着。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM