[英]R creating a comprehensive table of correlation between combinations of columns
这是我的数据集。 我在看棒球数据。
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
我正在尝试创建一个多元线性回归并决定要包含哪些预测变量。 问题是,我认为其中一些变量将真正相互关联。 例如,其中一列是“击球手的基本命中(任何类型的击球)”,另一列是“击球手的双打”等等。 所以我认为如果一个球员得分双倍,它会在多个不同的列中检查+1。
我试图弄清楚要包括哪些变量,我想到的一个策略是确定这些变量中的哪些彼此相关以及它们的相关性有多强。 也许我不会包括彼此之间真正密切相关的变量。 (对此有帮助吗?)
我开始走这条路,一一查看皮尔逊相关性:
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)
但后来我看到所有这些变量之间有多少排列:这个 dataframe 中有 16 列,我想 select 任意两个,16,/(2.(16 - 2)。)如果我的数学是正确的。 这将是 120 行代码。 而且很容易纠结并忘记我已经完成了哪些......所以效率不高。
所以我最初的问题是:是否有任何有效的编码方法来比较 dataframe 中变量之间的综合相关性集?
然后我在 Stack Overflow 上发现了这篇很棒的帖子,我认为它回答了我的问题,但我仍然无法让它发挥作用。
旁注 - 我还试图找出哪些列具有 NA 值,以防这里的 NA 值有所不同。
any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE
(旁注 - 是否有更有效的方法来执行此 an(is.na)) 代码?)
为了继续,我现在按照另一个 Stack Overflow 答案的方向,整洁的方法,我不完全理解,但给出答案的人似乎很聪明:
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
moneyball_training_data %>%
select(-INDEX) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2)
但是结果怎么可能只是 3 x 3 dataframe? 我期望像下面的图一样,其中每个数字都是 x 和 y 的相关性,其中删除了冗余的空白空间。
1 2 3 4 5 6 7
1 12 13 14 15 16 17
2 23 24 25 26 27
3 34 35 36 37
4 45 46 47
5 56 57
6 67
7
你期待这样的矩阵吗?
df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")
# 1. Output:
# A tibble: 17 x 18
term INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 INDEX NA 0.642 -0.820 -0.291 0.0236 0.0826 0.205
2 TARG~ 0.642 NA -0.685 0.367 -0.373 0.673 0.788
3 TEAM~ -0.820 -0.685 NA 0.192 0.496 -0.449 -0.502
4 TEAM~ -0.291 0.367 0.192 NA -0.0789 0.640 0.653
5 TEAM~ 0.0236 -0.373 0.496 -0.0789 NA -0.752 -0.676
6 TEAM~ 0.0826 0.673 -0.449 0.640 -0.752 NA 0.984
7 TEAM~ 0.205 0.788 -0.502 0.653 -0.676 0.984 NA
8 TEAM~ 0.134 0.401 -0.560 0.377 -0.754 0.864 0.799
9 TEAM~ 0.790 -0.00267 -0.690 -0.356 0.413 -0.528 -0.541
10 TEAM~ 0.874 -0.0332 -0.834 -0.598 0.261 -0.578 -0.623
11 TEAM~ NA NA NA NA NA NA NA
12 TEAM~ -0.662 -0.923 0.733 -0.358 0.448 -0.771 -0.852
13 TEAM~ -0.352 0.308 -0.127 0.661 -0.767 0.891 0.809
14 TEAM~ -0.914 -0.793 0.736 0.0225 0.0863 -0.341 -0.464
15 TEAM~ -0.667 -0.930 0.719 -0.360 0.424 -0.757 -0.842
16 TEAM~ -0.707 -0.925 0.757 -0.314 0.418 -0.733 -0.820
17 TEAM~ 0.0666 0.265 -0.144 -0.583 -0.447 -0.123 -0.150
快速回答隐藏在这篇文章中的一个附带问题:更有效的方法来查找其中包含 NA 值的列,而不是逐个查找
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.