简体   繁体   中英

Recoding a large number of variables using another data frame in R

I'd like to use a data frame (Df2) to recode the variables of another data frame (Df1), so that the end result is a data frame that contains text like local/international rather than 1s/2s (Df3). Missingness is present in the Df1 data frame, and I'd like to make sure it's represented as NA.

This is a minimal working example, the actual data set contains more than a hundred variables (all of which are of the character class) with between one and fifteen levels. Any help would be much appreciated.

Starting point (dfs)

Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))

Desired outcome (df)

Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))

Thoughts, not really code, so far: (If there's a match between a row of the df2 NameOfVariable and a df1 variable name, as well as a match between a row of df2 VariableLevel and a df1 observation, then paste the corresponding row of df2 VariableDef into df1. Wondering if you can use if statements for it.)

if (Df2["NameOfVariable"]==names(Df1))
{
  if (Df2["VariableLevel"]==Df1[ ])
  {
   Df1[ ] <- paste0("VariableDef") 
  }
}

Here is on method in base R using match and Map . Map applies a function to corresponding list elements. Here, there are two list elements: Df1 and a list that is composed of the second and third columns of Df2, split by column 1. The second list is reordered to match the order of the names in Df1.

The applied function matches elements in a column Df1 to the corresponding column in the second argument and uses it as an index to return the corresponding name of the Df2 argument. Map returns a list, which is converted to a data.frame with the function of the same name.

data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
               Df1,
               split(Df2[2:3], Df2[1])[names(Df1)]))

this returns

  buyer_Q1 seller_Q2 price_Q1_2
1    local  internat    50-100K
2 internat     local   100-200K
3    local        NA      200+K
4    local  internat   100-200K

Solution using loop and factors. Be careful. Results seem equivalent but they are not. The function fun return data frame with factors. If needed you can convert them to characters.

Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))

fun <- function(df, mdf) {
  for (varn in names(df)) {
    dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
    df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
  }
  return(df)
}

fun(Df1, Df2)
Df3

A solution from dplyr and tidyr . The code will work fine even with warning messages because the columns are in factor. If you don't want to see any warning messages, set stringsAsFactors = FALSE when creating the data frame like the example I provided.

library(dplyr)
library(tidyr)

Df3 <- Df1 %>%
  mutate(ID = 1:n()) %>%
  gather(NameOfVariable, VariableLevel, -ID) %>%
  left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
  select(-VariableLevel) %>%
  spread(NameOfVariable, VariableDef) %>%
  select(-ID)

Df3
  buyer_Q1 price_Q1_2 seller_Q2
1    local    50-100K  internat
2 internat   100-200K     local
3    local      200+K        NA
4    local   100-200K  internat

DATA

Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
                  "seller_Q2"=c(2,1,3,2),
                  "price_Q1_2"=c(2,5,7,5),
                  stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
                  "VariableLevel"=c(1,2,1,2,3,2,5,7),
                  "VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
                  stringsAsFactors = FALSE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM