简体   繁体   中英

R: Recoding multiple dummy variables into a single variable and replacing the corresponding dummy value with the variable name

I have a dataset with 14 mutually exclusive categories of call type all coded as dummy variables. Here is a small sample:

dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), 
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("MON1_12", 
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", 
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "data.frame", row.names = c(NA, 
-10L))

I want to combine each of the dummy variables into a single new variable called "QUEUE" that replaces the value of "1" with the name of the dummy variable its corresponding dummy variable. Here is an example of what this would look like:

dput(df2)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), QUEUE = structure(c(1L, 4L, 2L, 4L, 1L, 3L, 
3L, 5L, 5L, 4L), .Label = c("CLAIMS", "CONTENT", "CREDIT_CARD", 
"DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12", 
"WEEK1_53", "AGENT_ID", "CallsHandled", "QUEUE"), class = "data.frame", row.names = c(NA, 
-10L))

Edit in response to having question marked down: This is what I had tried this afternoon on recommendation with a slightly different sample dataframe:

df$Queue <- as.factor(df$CONTENT + df$CLAIMS*2 + df$CREDIT_CARD*3 +  df$DEDUCT_BILL*4 + df$HCREFORM*5)
levels(df$Queue) <- c("CONTENT", "CLAIMS", "CREDIT_CARD","DEDUCT_BILL","HCREFORM")
View(df)

But I received a column of NA's in the Queue column. So, I recreated another sample dataset here. This dataframe is adequately representative of what I'll receive in reality, except I'll have about 40 variables and 2 million rows. When I run what I tried above on "df" above I get the following incorrect result:

dput(df)
structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), CONTENT = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), CLAIMS = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
CREDIT_CARD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
DEDUCT_BILL = c(0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), 
HCREFORM = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Queue = structure(c(2L, 
1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CONTENT", 
"CLAIMS", "CREDIT_CARD", "DEDUCT_BILL", "HCREFORM"), class = "factor")), .Names = c("MON1_12", 
"WEEK1_53", "AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", 
"CREDIT_CARD", "DEDUCT_BILL", "HCREFORM", "Queue"), row.names = c(NA, 
-10L), class = "data.frame")

I also tried:

df3 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))

but received the following error: "Error in data.frame("CLAIMS", character(0), character(0), "DEDUCT_BILL", : arguments imply differing number of rows: 1, 0:

This should produce the desired result:

df2 <- cbind(df[1:4], QUEUE = apply(df[5:9], 1, function(N) names(N)[as.logical(N)]))

provided that only one and exactly one of the dummy variables is 1 in any of the rows (which is not true in your original sample of df ).

Explanation: df[1:4] selects the columns one through four to be preserved in the output. It is then column bound to QUEUE using cbind function. QUEUE is obtained by iterating through the dummy variables (columns five through nine), row-wise over the data set df and selecting the column-name that contains the value one.

You could use max.col to get the column index that have a value of '1' in each row for columns 5 to 9. (The 'df' example is not correct as most of the rows were all 0s. The corrected one is below).

df$QUEUE <-  names(df)[-c(1:4)][max.col(df[-c(1:4)])]

Or you can do

df$QUEUE <-  names(df)[-(1:4)][(as.matrix(df[-(1:4)]) %*% 
                         seq_along(df[-(1:4)]))[,1]]

Update

Based on the edit dataset 'df', some rows are all '0's for the columns 5:9, and in the expected result, it is showed that 'QUEUE' as 'CONTENT'. In that case, we can first modify the 'CONTENT' column to change the values where rows are all 0's and then apply either of the code above

 df$CONTENT[!rowSums(df[5:9])] <- 1
 df$QUEUE1 <-  names(df)[5:9][max.col(df[5:9])]
 df$QUEUE1
 #[1] "CLAIMS"      "CONTENT"     "CONTENT"     "DEDUCT_BILL" "CONTENT"    
 #[6] "CONTENT"     "CONTENT"     "CONTENT"     "CONTENT"     "CONTENT" 

data

df <- structure(list(MON1_12 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), WEEK1_53 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
AGENT_ID = structure(c(3L, 
4L, 7L, 8L, 1L, 6L, 5L, 9L, 2L, 10L), .Label = c("A129", "A360", 
"A407", "B891", "D197", "L145", "L722", "O518", "T443", "W764"
), class = "factor"), CallsHandled = c(1L, 4L, 2L, 14L, 1L, 2L, 
5L, 1L, 1L, 3L), CONTENT = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0), CLAIMS = c(1, 
0, 0, 0, 1, 0, 0, 0, 0, 0), CREDIT_CARD = c(0, 0, 0, 0, 0, 1, 
1, 0, 0, 0), DEDUCT_BILL = c(0, 1, 0, 1, 0, 0, 0, 0, 0, 1),
 HCREFORM = c(0, 
0, 0, 0, 0, 0, 0, 1, 1, 0)), .Names = c("MON1_12", "WEEK1_53", 
"AGENT_ID", "CallsHandled", "CONTENT", "CLAIMS", "CREDIT_CARD", 
"DEDUCT_BILL", "HCREFORM"), row.names = c(NA, -10L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM