简体   繁体   中英

R creating a comprehensive table of correlation between combinations of columns

Here is a look at my dataset. I'm looking at baseball data.

structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

I'm trying to create a multiple linear regression and decide which predictor variables to include. The problem is, some of these variables I think are going to be really correlated with each other. For example, one of the columns is "base hits by batters (any kind of hit)" and another column is "doubles by batters" and so on. So I think if a player scores a double it would check +1 in multiple different columns.

I'm trying to figure out which variables to include and one strategy I have in mind is deciding which of these variables are correlated with each other and how strongly they are correlated. Maybe variables that are really strongly correlated with each other I won't include. (Help on this?)

I started down this road, looking at pearson correlation one-by-one:

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)

cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)

But then I saw how many permutations there are between all of these variables: There are 16 columns in this dataframe and I want to select any two, 16, / (2. (16 - 2).) If my math is right. this would be 120 lines of code by doing it this method. and it would be easy to get tangled and lose track of which ones I've already done... So not very efficient.

So my original question was: Is there any efficient coding method to compare the comprehensive set of correlations between variables in a dataframe?

I then found this amazing post on Stack Overflow that I think answers my question but I still can't quite get it to work.

Side note - I also tried to figure out which columns had NA values in case NA values here made a difference.

any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE

(side note - is there a more efficient way to do this an(is.na)) code?)

To continue, I now follow the direction of the other Stack Overflow answer, the tidy method, which I don't fully understand but the guy who gave the answer seemed smart:

# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)

moneyball_training_data %>% 
  select(-INDEX) %>%                # remove unnecessary columns
  cor() %>%                      # get all correlations (even ones you don't care about)
  data.frame() %>%               # save result as a dataframe
  mutate(v1 = row.names(.)) %>%  # add row names as a column
  gather(v2,cor, -v1) %>%        # reshape data
  filter(f(v1,v2) & v1 != v2)

在此处输入图像描述

But how can the result just be a 3 x 3 dataframe? I expected something like my drawing below, where each number would be a correlation of an x and y with voided spaces for redundancies removed.

   1     2   3    4    5    6     7
1       12   13  14   15   16    17
2            23  24   25   26    27
3                34   35   36    37
4                     45   46    47
5                          56    57
6                                67
7

Do you expect such kind of matrix?

df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L, 
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")

# 1. Output:
# A tibble: 17 x 18
   term    INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
   <chr>   <dbl>       <dbl>          <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
 1 INDEX NA          0.642           -0.820         -0.291           0.0236          0.0826           0.205
 2 TARG~  0.642     NA               -0.685          0.367          -0.373           0.673            0.788
 3 TEAM~ -0.820     -0.685           NA              0.192           0.496          -0.449           -0.502
 4 TEAM~ -0.291      0.367            0.192         NA              -0.0789          0.640            0.653
 5 TEAM~  0.0236    -0.373            0.496         -0.0789         NA              -0.752           -0.676
 6 TEAM~  0.0826     0.673           -0.449          0.640          -0.752          NA                0.984
 7 TEAM~  0.205      0.788           -0.502          0.653          -0.676           0.984           NA    
 8 TEAM~  0.134      0.401           -0.560          0.377          -0.754           0.864            0.799
 9 TEAM~  0.790     -0.00267         -0.690         -0.356           0.413          -0.528           -0.541
10 TEAM~  0.874     -0.0332          -0.834         -0.598           0.261          -0.578           -0.623
11 TEAM~ NA         NA               NA             NA              NA              NA               NA    
12 TEAM~ -0.662     -0.923            0.733         -0.358           0.448          -0.771           -0.852
13 TEAM~ -0.352      0.308           -0.127          0.661          -0.767           0.891            0.809
14 TEAM~ -0.914     -0.793            0.736          0.0225          0.0863         -0.341           -0.464
15 TEAM~ -0.667     -0.930            0.719         -0.360           0.424          -0.757           -0.842
16 TEAM~ -0.707     -0.925            0.757         -0.314           0.418          -0.733           -0.820
17 TEAM~  0.0666     0.265           -0.144         -0.583          -0.447          -0.123           -0.150

Quick answer to a side question buried in this post: more efficient way to find columns with NA values in them instead of going one-by-one

moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM