简体   繁体   English

R 创建列组合之间相关性的综合表

[英]R creating a comprehensive table of correlation between combinations of columns

Here is a look at my dataset.这是我的数据集。 I'm looking at baseball data.我在看棒球数据。

structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

I'm trying to create a multiple linear regression and decide which predictor variables to include.我正在尝试创建一个多元线性回归并决定要包含哪些预测变量。 The problem is, some of these variables I think are going to be really correlated with each other.问题是,我认为其中一些变量将真正相互关联。 For example, one of the columns is "base hits by batters (any kind of hit)" and another column is "doubles by batters" and so on.例如,其中一列是“击球手的基本命中(任何类型的击球)”,另一列是“击球手的双打”等等。 So I think if a player scores a double it would check +1 in multiple different columns.所以我认为如果一个球员得分双倍,它会在多个不同的列中检查+1。

I'm trying to figure out which variables to include and one strategy I have in mind is deciding which of these variables are correlated with each other and how strongly they are correlated.我试图弄清楚要包括哪些变量,我想到的一个策略是确定这些变量中的哪些彼此相关以及它们的相关性有多强。 Maybe variables that are really strongly correlated with each other I won't include.也许我不会包括彼此之间真正密切相关的变量。 (Help on this?) (对此有帮助吗?)

I started down this road, looking at pearson correlation one-by-one:我开始走这条路,一一查看皮尔逊相关性:

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)

But then I saw how many permutations there are between all of these variables: There are 16 columns in this dataframe and I want to select any two, 16, / (2. (16 - 2).) If my math is right.但后来我看到所有这些变量之间有多少排列:这个 dataframe 中有 16 列,我想 select 任意两个,16,/(2.(16 - 2)。)如果我的数学是正确的。 this would be 120 lines of code by doing it this method.这将是 120 行代码。 and it would be easy to get tangled and lose track of which ones I've already done... So not very efficient.而且很容易纠结并忘记我已经完成了哪些......所以效率不高。

So my original question was: Is there any efficient coding method to compare the comprehensive set of correlations between variables in a dataframe?所以我最初的问题是:是否有任何有效的编码方法来比较 dataframe 中变量之间的综合相关性集?

I then found this amazing post on Stack Overflow that I think answers my question but I still can't quite get it to work.然后我在 Stack Overflow 上发现了这篇很棒的帖子,我认为它回答了我的问题,但我仍然无法让它发挥作用。

Side note - I also tried to figure out which columns had NA values in case NA values here made a difference.旁注 - 我还试图找出哪些列具有 NA 值,以防这里的 NA 值有所不同。

any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE

(side note - is there a more efficient way to do this an(is.na)) code?) (旁注 - 是否有更有效的方法来执行此 an(is.na)) 代码?)

To continue, I now follow the direction of the other Stack Overflow answer, the tidy method, which I don't fully understand but the guy who gave the answer seemed smart:为了继续,我现在按照另一个 Stack Overflow 答案的方向,整洁的方法,我不完全理解,但给出答案的人似乎很聪明:

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

moneyball_training_data %>% 
  select(-INDEX) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)

在此处输入图像描述

But how can the result just be a 3 x 3 dataframe?但是结果怎么可能只是 3 x 3 dataframe? I expected something like my drawing below, where each number would be a correlation of an x and y with voided spaces for redundancies removed.我期望像下面的图一样,其中每个数字都是 x 和 y 的相关性,其中删除了冗余的空白空间。

   1     2   3    4    5    6     7
1       12   13  14   15   16    17
2            23  24   25   26    27
3                34   35   36    37
4                     45   46    47
5                          56    57
6                                67
7

Do you expect such kind of matrix?你期待这样的矩阵吗?

df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")

# 1. Output:
# A tibble: 17 x 18
   term    INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
   <chr>   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
 1 INDEX NA          0.642           -0.820         -0.291           0.0236          0.0826           0.205
 2 TARG~  0.642     NA               -0.685          0.367          -0.373           0.673            0.788
 3 TEAM~ -0.820     -0.685           NA              0.192           0.496          -0.449           -0.502
 4 TEAM~ -0.291      0.367            0.192         NA              -0.0789          0.640            0.653
 5 TEAM~  0.0236    -0.373            0.496         -0.0789         NA              -0.752           -0.676
 6 TEAM~  0.0826     0.673           -0.449          0.640          -0.752          NA                0.984
 7 TEAM~  0.205      0.788           -0.502          0.653          -0.676           0.984           NA    
 8 TEAM~  0.134      0.401           -0.560          0.377          -0.754           0.864            0.799
 9 TEAM~  0.790     -0.00267         -0.690         -0.356           0.413          -0.528           -0.541
10 TEAM~  0.874     -0.0332          -0.834         -0.598           0.261          -0.578           -0.623
11 TEAM~ NA         NA               NA             NA              NA              NA               NA    
12 TEAM~ -0.662     -0.923            0.733         -0.358           0.448          -0.771           -0.852
13 TEAM~ -0.352      0.308           -0.127          0.661          -0.767           0.891            0.809
14 TEAM~ -0.914     -0.793            0.736          0.0225          0.0863         -0.341           -0.464
15 TEAM~ -0.667     -0.930            0.719         -0.360           0.424          -0.757           -0.842
16 TEAM~ -0.707     -0.925            0.757         -0.314           0.418          -0.733           -0.820
17 TEAM~  0.0666     0.265           -0.144         -0.583          -0.447          -0.123           -0.150

Quick answer to a side question buried in this post: more efficient way to find columns with NA values in them instead of going one-by-one快速回答隐藏在这篇文章中的一个附带问题:更有效的方法来查找其中包含 NA 值的列,而不是逐个查找

moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM