I have the following covariance matrix in R:
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
Each column and row corresponds to a football player (ie, AB-2000). So the intersection of AB-2000, AB-2000 gives the variance for that players performance. A row like AB-2000, AF-0200 gives the covariance of two players performance.
Currently, the matrix shows all covariance values. However, not all covariance values matter. In fact, the only ones that matter are when two players are playing the same game that week (in this case, have the same game ID (GID)).
The following table shows the GID for a PLAYER on certain week:
GID PLAYER
3467 AB-2000
3460 AB-2600
3463 AB-3500
3467 AC-0100
3458 AD-0100
3461 AF-0200
How do I go about keeping only the values in the covariance matrix when the two players have the same GID (for instance, players AB-2000 and AC-0100)?
Thanks for the help!
I think this does what you're asking, if I'm interpreting the question correctly. I've given you a couple solutions, pick your poison. The first relies on a nested for loop which could be slow and further optimized if you knew for sure your matrix was symmetric.
m <- read.table(header=T, stringsAsFactors=F, text="
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.5 NA -1.8 3.65 -17.96 -26.5
AB-2600 NA 7.18 NA NA NA NA
AB-3500 -1.79 NA 5.4 NA -4.63 NA
AC-0100 3.65 NA NA 4.22 9.8 NA
AD-0100 -17.96 NA -4.63 9.8 5.9 NA
AF-0200 -26.5 NA NA NA NA 4.28
")
p <- read.table(header=T, stringsAsFactors=F, text="
GID PLAYER
3467 AB-2000
3460 AB-2600
3463 AB-3500
3467 AC-0100
3458 AD-0100
3461 AF-0200
")
m_t2 <- cm
names(m_t2) <- row.names(m_t2)
## Replace names with GID:
row_names <- p$GID[which(p$PLAYER == row.names(m_t2))]
col_names <- p$GID[which(p$PLAYER == names(m_t2))]
for (i in 1:nrow(m_t2)) {
m_t2[i, col_names != row_names[i]] <- NA
}
m_t2 <- as.matrix(m_t2)
Alternatively this solution does relies on the tidyr
and dplyr
packages but it should be quite efficient for very large datasets:
m <- cm
names(m) <- row.names(m)
m$row_names <- row.names(m)
library(tidyr)
library(dplyr)
d <- m %>%
gather(col_names, "cv", -row_names, convert=T) %>%
left_join(p, by = c("row_names" = "PLAYER")) %>%
mutate(GID_row = GID) %>%
select(-GID) %>%
left_join(p, by=c("col_names" = "PLAYER")) %>%
mutate(GID_col = GID) %>%
mutate(new_cv = ifelse((GID_row == GID_col), cv, NA)) %>%
select(row_names, col_names, new_cv) %>%
spread(col_names, new_cv)
m_t <- as.matrix(d[,-1])
row.names(m_t) <- d[["row_names"]]
The solution in either case looks like this:
> m_t
AB-2000 AB-2600 AB-3500 AC-0100 AD-0100 AF-0200
AB-2000 6.50 NA NA 3.65 NA NA
AB-2600 NA 7.18 NA NA NA NA
AB-3500 NA NA 5.4 NA NA NA
AC-0100 3.65 NA NA 4.22 NA NA
AD-0100 NA NA NA NA 5.9 NA
AF-0200 NA NA NA NA NA 4.28
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.