简体   繁体   中英

Using dplyr to mutate factor variable conditional on numeric variables

I have data from a sports team tracking wins/losses versus other teams with the following structure:

Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss ...
1    1         0           1        NA        NA       NA
2    1         1           NA       NA        NA       1
3    2         1           NA       NA        1        NA
4    2         2           NA       1         NA       NA
5    3         2           NA       NA        1        NA
...

I want to create a factor variable that includes the team the game was played against so that the data looks like this:

Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss Team
1    1         0           1        NA        NA       NA        Team1
2    1         1           NA       NA        NA       1         Team2
3    2         1           NA       NA        1        NA        Team2
4    2         2           NA       1         NA       NA        Team1
5    3         2           NA       NA        1        NA        Team2
...

My thinking (NOT workable code) is essentially this:

if (Team1Win == 1 | Team1Loss == 1), Team = "Team1"
if (Team2Win == 1 | Team2Loss == 1), Team = "Team2"

I'm really struggling with how to do this in dplyr using mutate. I've tried various approaches with ifelse, recode, etc., but I either keep getting errors or results that are not what I want.

What is correct and most efficient way to make this work in dplyr?

Along similar lines to other answers, but with some useful changes:

  • na.rm = TRUE in gather
  • sub works just fine, no need for stringr
  • includes the full data as per the goal, using a full_join .
library(dplyr)
library(tidyr)

df = read.delim(text = 
"Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss
1 1 0 1 NA NA NA
2 1 1 NA NA NA 1
3 2 1 NA NA 1 NA
4 2 2 NA 1 NA NA
5 3 2 NA NA 1 NA", sep = " ")

df %>% 
  select(-starts_with("Total")) %>% 
  gather(Team, one, -Game, na.rm = TRUE) %>% 
  select(-one) %>% 
  mutate(Team = sub("Win|Loss", "", Team)) %>% 
  full_join(df, .)

#> Joining, by = "Game"
#>   Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss  Team
#> 1    1         1           0        1        NA       NA        NA Team1
#> 2    2         1           1       NA        NA       NA         1 Team2
#> 3    3         2           1       NA        NA        1        NA Team2
#> 4    4         2           2       NA         1       NA        NA Team1
#> 5    5         3           2       NA        NA        1        NA Team2

I'm a sucker for dplyr approaches to things at the moment so I offer a solution using dplyr that extends to however many teams you might have. It does also use tidyr and stringr as helpfully pointed out in the comment by apom below.

library(dplyr)
library(tidyr)
library(stringr)

df = read_delim(
"Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss
1 1 0 1 NA NA NA
2 1 1 NA NA NA 1
3 2 1 NA NA 1 NA
4 2 2 NA 1 NA NA
5 3 2 NA NA 1 NA",delim = " ")

df %>% 
  gather("Team",value,contains("Team")) %>% 
  filter(!is.na(value)) %>%
  mutate(Team = str_replace_all(Team,c("Win" = "","Loss" = ""))) %>%
  select(-value)

this could be what you are looking for. ( not hardcoded for just 2 teams)

# solution 1 :
paste0("Team",ceiling(apply(df[-c(1:3)], 1, function(x) which(!is.na(x)))/2))
[1] "Team1" "Team2" "Team2" "Team1" "Team2"

# solution 2: using a apply() {basically a for loop itself]
apply(df[-c(1:3)], 1, function(x) gsub("(Team\\d)+.*", "\\1", colnames(df[-c(1:3)])[which(!is.na(x))]))
# [1] "Team1" "Team2" "Team2" "Team1" "Team2"

# solution 3: (long route to dplyr) [ you have indirectly taught me a lot in dplyr through my search for this solution]
func <- function(x){
  y = which(x == 1)             # get the location of where 1 appears
  z = rep(0, times = length(x)) # create a vector of 0's+location of 1
  z[y] = y                      # i.e. c(0,0,3,0,5) for Team2Win 
  z
}

df1 = df[-c(1:3)] %>% gather("key", "value", starts_with("Team")) %>% 
                group_by(key) %>%
                dplyr::mutate(x = func(value)) %>%
                filter(x != 0) %>% arrange(x) %>% select(key)

df$newcol = gsub("(Team\\d+).*", "\\1", df1$key)

You can do it using a simple loop:

x = colnames(df)
df$team<- NA

for (i in 1:nrow(df))
{
 df$team[i] = x[which(df[i,]==1)]
}

And then at the end you can trim "win" and "loss" using following function:

df$team<- gsub("win", "",df$team)
df$team<- gsub("loss", "",df$team)

I'm pretty sure that you have more than two teams in your data, and team names are not generic. What you want to do is to reshape the data to long form first and extract the relevant team name. So you may want to proceed as follows.

library(dplyr)
library(tidyr)
new_df <- df %>% 
  gather(team,idx,Team1Win:Team100Loss) %>%
  filter(!is.na(idx)) %>%
  select(-idx) %>%
  mutate(team = gsub("Win|Loss","",team))

If you want to keep those wide columns, then you can join the new DF to the old one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM