Here is a look at my dataset. I'm looking at baseball data.
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
I'm trying to create a multiple linear regression and decide which predictor variables to include. The problem is, some of these variables I think are going to be really correlated with each other. For example, one of the columns is "base hits by batters (any kind of hit)" and another column is "doubles by batters" and so on. So I think if a player scores a double it would check +1 in multiple different columns.
I'm trying to figure out which variables to include and one strategy I have in mind is deciding which of these variables are correlated with each other and how strongly they are correlated. Maybe variables that are really strongly correlated with each other I won't include. (Help on this?)
I started down this road, looking at pearson correlation one-by-one:
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_2B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_3B)
cor(moneyball_training_data$TEAM_BATTING_H, moneyball_training_data$TEAM_BATTING_HR)
But then I saw how many permutations there are between all of these variables: There are 16 columns in this dataframe and I want to select any two, 16, / (2. (16 - 2).) If my math is right. this would be 120 lines of code by doing it this method. and it would be easy to get tangled and lose track of which ones I've already done... So not very efficient.
So my original question was: Is there any efficient coding method to compare the comprehensive set of correlations between variables in a dataframe?
I then found this amazing post on Stack Overflow that I think answers my question but I still can't quite get it to work.
Side note - I also tried to figure out which columns had NA values in case NA values here made a difference.
any(is.na(moneyball_training_data$TARGET_WINS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_2B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_3B)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
any(is.na(moneyball_training_data$TEAM_BATTING_SB)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_CS)) # FALSE
any(is.na(moneyball_training_data$TEAM_BATTING_HBP)) # TRUE
any(is.na(moneyball_training_data$TEAM_PITCHING_H)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_HR)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_BB)) # FALSE
any(is.na(moneyball_training_data$TEAM_PITCHING_SO))# TRUE
any(is.na(moneyball_training_data$TEAM_FIELDING_E)) # FALSE
any(is.na(moneyball_training_data$TEAM_FIELDING_DP)) # TRUE
(side note - is there a more efficient way to do this an(is.na)) code?)
To continue, I now follow the direction of the other Stack Overflow answer, the tidy method, which I don't fully understand but the guy who gave the answer seemed smart:
# function to use later (to filter out rows)
f = function(x,y) grepl(x,y)
f = Vectorize(f)
moneyball_training_data %>%
select(-INDEX) %>% # remove unnecessary columns
cor() %>% # get all correlations (even ones you don't care about)
data.frame() %>% # save result as a dataframe
mutate(v1 = row.names(.)) %>% # add row names as a column
gather(v2,cor, -v1) %>% # reshape data
filter(f(v1,v2) & v1 != v2)
But how can the result just be a 3 x 3 dataframe? I expected something like my drawing below, where each number would be a correlation of an x and y with voided spaces for redundancies removed.
1 2 3 4 5 6 7
1 12 13 14 15 16 17
2 23 24 25 26 27
3 34 35 36 37
4 45 46 47
5 56 57
6 67
7
Do you expect such kind of matrix?
df <- structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842L, 1075L, 917L,
922L, 920L, 973L), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L,
107L), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
# install.packages("corrr")
library(corrr)
df1 <- corrr::correlate(df, method = "pearson")
# 1. Output:
# A tibble: 17 x 18
term INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 INDEX NA 0.642 -0.820 -0.291 0.0236 0.0826 0.205
2 TARG~ 0.642 NA -0.685 0.367 -0.373 0.673 0.788
3 TEAM~ -0.820 -0.685 NA 0.192 0.496 -0.449 -0.502
4 TEAM~ -0.291 0.367 0.192 NA -0.0789 0.640 0.653
5 TEAM~ 0.0236 -0.373 0.496 -0.0789 NA -0.752 -0.676
6 TEAM~ 0.0826 0.673 -0.449 0.640 -0.752 NA 0.984
7 TEAM~ 0.205 0.788 -0.502 0.653 -0.676 0.984 NA
8 TEAM~ 0.134 0.401 -0.560 0.377 -0.754 0.864 0.799
9 TEAM~ 0.790 -0.00267 -0.690 -0.356 0.413 -0.528 -0.541
10 TEAM~ 0.874 -0.0332 -0.834 -0.598 0.261 -0.578 -0.623
11 TEAM~ NA NA NA NA NA NA NA
12 TEAM~ -0.662 -0.923 0.733 -0.358 0.448 -0.771 -0.852
13 TEAM~ -0.352 0.308 -0.127 0.661 -0.767 0.891 0.809
14 TEAM~ -0.914 -0.793 0.736 0.0225 0.0863 -0.341 -0.464
15 TEAM~ -0.667 -0.930 0.719 -0.360 0.424 -0.757 -0.842
16 TEAM~ -0.707 -0.925 0.757 -0.314 0.418 -0.733 -0.820
17 TEAM~ 0.0666 0.265 -0.144 -0.583 -0.447 -0.123 -0.150
Quick answer to a side question buried in this post: more efficient way to find columns with NA values in them instead of going one-by-one
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.