简体   繁体   English

R-运行Spearman相关性时p值不一致

[英]R - Inconsistent p-value in running Spearman correlation

My problem is when I compute running correlation for some odd reason I do not get the same p-value for the same estimates/correlations values. 我的问题是,当出于某种奇怪的原因计算运行相关性时,对于相同的估计/相关性值,我没有获得相同的p值。

My target is to calculate a running Spearman correlation on two vectors in the same data.frame (subject1 and subject2 in the example below). 我的目标是在同一data.frame(以下示例中的subject1和subject2)上的两个向量上计算正在运行的Spearman相关性。 In addition, my window (length of the vector) and stide (the jumps/steps between each window) are constant. 另外,我的窗口(向量的长度)和步幅(每个窗口之间的跳跃/步长)是恒定的。 As such, when looking at the formula below (from wiki ) I should get the same critical t hence the same p-value for the same Spearman correlation. 这样,当查看下面的公式时(来自wiki ),我应该得到相同的临界t,因此对于相同的Spearman相关性,其p值也应相同。 These is because the n states the same (it's the same window size) and the r is same. 这是因为n状态相同(窗口大小相同), r相同。 However, my end p value is different. 但是,我的最终p值不​​同。

在此处输入图片说明

#Needed pkgs    
require(tidyverse)
require(pspearman)
require(gtools)

#Sample data
set.seed(528)
subject1 <- rnorm(40, mean = 85, sd = 5)

set.seed(528)
subject2 <- c(
  lag(subject1[1:21]) - 10, 
  rnorm(n = 6, mean = 85, sd = 5), 
  lag(subject1[length(subject1):28]) - 10)

df <- data.frame(subject1 = subject1, 
                 subject2 = subject2) %>% 
  rowid_to_column(var = "Time") 

df[is.na(df)] <- subject1[1] - 10

rm(subject1, subject2)

#Function for Spearman
psSpearman <- function(x, y) 
{
  out <- pspearman::spearman.test(x, y,
                                  alternative = "two.sided", 
                                  approximation = "t-distribution") %>% 
    broom::tidy()
  return(data.frame(estimate = out$estimate,
                    statistic = out$statistic,
                    p.value = out$p.value )
}

#Running correlation along the subjects
dfRunningCor <- running(df$subject1, df$subject2, 
                        fun = psSpearman,
                        width = 20,
                        allow.fewer = FALSE, 
                        by = 1,
                        pad = FALSE, 
                        align = "right") %>% 
  t() %>% 
  as.data.frame() 

#Arranging the Results into easy to handle data.frame 
Results <- do.call(rbind.data.frame, dfRunningCor) %>% 
  t() %>%
  as.data.frame() %>%
  rownames_to_column(var = "Win") %>% 
  gather(CorValue, Value, -Win) %>% 
  separate(Win, c("fromIndex", "toIndex")) %>%
  mutate(fromIndex = as.numeric(substring(fromIndex, 2)),
         toIndex = as.numeric(toIndex, 2)) %>%
  spread(CorValue, Value) %>% 
  arrange(fromIndex) %>% 
  select(fromIndex, toIndex, estimate, statistic, p.value)

My problem is when I plot the Results with estimates (Spearman rho; estimate ), window number ( fromIndex ) and I color the p value, I should get like a "tunnel"/"path" of the same color across the same area - I don't. 我的问题是,当我用估算值(Spearman rho; estimate ),窗口编号( fromIndex )绘制Results并为p值上色时,我应该在同一区域获得类似颜色的“隧道” /“路径”-我不。 For example, in the picture below, points in the same height in the red circle should be with the same color - but the aren't. 例如,在下面的图片中,红色圆圈中相同高度的点应具有相同的颜色-但不同。 在此处输入图片说明

Code for the graph: 图形代码:

Results %>% 
  ggplot(aes(fromIndex, estimate, color = p.value)) + 
  geom_line()

What I found so far is that it might might be due to: 1. Functions like Hmisc::rcorr() tend to not give the same p.value in small sample or many ties. 到目前为止我发现的原因可能是:1.在小的样本或许多联系中,像Hmisc::rcorr()这样的函数往往不会给出相同的Hmisc::rcorr() This is why I use pspearman::spearman.test which from what I read here suppose to solve this problem. 这就是为什么我使用pspearman::spearman.test ,根据我在这里阅读的内容,它可以解决此问题。 2. Small sample size - I tried using a bigger sample size. 2.小​​样本-我尝试使用大样本。 I still get the same problem. 我仍然遇到同样的问题。 3. I tried rounding my p values - I still get the same problem. 3.我尝试舍入p值-我仍然遇到相同的问题。

Thank you for your help! 谢谢您的帮助!

Edit. 编辑。

Could it be "pseudo" coloring by ggplot? ggplot 可能是 “伪”着色吗? Could it be that ggplot just interpolate "last" color until the next point?. 可能是ggplot只是插值“最后一个”颜色直到下一个点? Which is why I get "light blue" from point 5 to 6 but "dark blue" from point 7 to 8? 这就是为什么我从第5点到第6点变成“浅蓝色”而从第7点到第8点变成“深蓝色”的原因?

在此处输入图片说明

The results you obtain for the p.value variable are coherent with the estimate value. 您为p.value变量获得的结果与estimate值一致。 You can check it as follows: 您可以按以下方式检查它:

Results$orderestimate <- order(-abs(Results$estimate))
Results$orderp.value <- order(abs(Results$p.value))
identical(Results$orderestimate ,Results$orderp.value)

I don't think you should include a colour for the p.value in the graph, it is an unnecessary visual distraction and it is hard to interpret. 我认为您不应该在图表中为p.value包括颜色,这是不必要的视觉干扰,并且难以解释。

If I were you I would only display the p.value and perhaps include a point to indicate the sign of the estimate variable. 如果您是我,我将只显示p.value并可能包含一个点来指示estimate变量的符号。

p <- Results %>% 
  ggplot(aes(fromIndex,  p.value)) + 
  geom_line()

# If you want to display the sign of the estimate
Results$estimate.sign <- as.factor(sign(Results$estimate))
p+geom_point( aes(color = estimate.sign ))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM