简体   繁体   中英

Creating new variables with dplyr::mutate() without conflicting names

I am writing a series of functions that use dplyr internally to manipulate data.

There are a number of places where I'd like to add new variables to the data set as I work with it. However, I am not sure how to name these new variables so as to avoid overwriting variables already in the data, given that I don't know what's in the data set being passed.

In base RI can do this:

df <- data.frame(a = 1:5)

df[, ncol(df)+1] <- 6:10

and it will select a name for the newly-added variable that doesn't conflict with any existing names. I'd like to do this in dplyr rather than breaking up the consistent application of dplyr to go back to base-R.

All the solutions I've thought of so far feel very kludgy, or require the use of a bunch of base-R futzing anyway that isn't any better than just adding the variable in base-R:

  1. Rename all the variables so I know what the names are
  2. Pull out the names() vector and use one of many methods to generate a name that isn't in the vector
  3. Error out if the user happens to have my internal variable names in their data (bad-practice Olympics!)

Is there a straightforward way to do this in dplyr ? Getting it to work in mutate would be ideal, although I suppose bind_cols or tibble::add_column would also be fine.

Some things I have tried that don't work:

df <- data.frame(a = 1:5)

# Gives the new variable a fixed title which might already be in there
df %>% mutate(6:10)
df %>% tibble::add_column(6:10)
df %>% mutate(NULL = 6:10)

# Error
df %>% bind_cols(6:10)
df %>% mutate( = 6:10)
df %>% mutate(!!NULL := 6:10)

# And an example of the kind of function I'm looking at:
# This function returns the original data arranged in a random order
# and also the random variable used to arrange it
arrange_random <- function(df) {
  df <- df %>%
    mutate(randomorder = runif(n())) %>%
    arrange(randomorder)

  return(df)
}

# No naming conflict, no problem!
data <- data.frame(a = 1:5)
arrange_random(data)

# Uh-oh, the original data gets lost!
data <- data.frame(randomorder = 1:5)
arrange_random(data)

I am posting this solution for now. This sounds like a case of not knowing one's data very well, so I think one good approach is to include an if-else statement in the function. The logic is that the user chooses some arbitrary new name to add as a suffix to their original variable name, but the function will return an error if the new name is already included in the original data. Otherwise, the function runs and returns the original data plus the newly mutated data.

df <- data.frame(a = 1:5, b=11:15, c=21:25)

# define function with if-else statement to catch any possible duplicates
addnew <- function(data,name='newvar'){
  if(sum(grepl(name,names(data),ignore.case=T))>0)
  {stop('Error! Possible duplicate names with your new variable names')} else{
  data1 <- data %>% mutate_all(list( ~ runif(n())))
  names(data1) <- paste0(names(data1),'_',name)
  bind_cols(data,data1)
    }
}

addnew(df,'new')

  a  b  c     a_new     b_new     c_new
1 1 11 21 0.2875775 0.0455565 0.9568333
2 2 12 22 0.7883051 0.5281055 0.4533342
3 3 13 23 0.4089769 0.8924190 0.6775706
4 4 14 24 0.8830174 0.5514350 0.5726334
5 5 15 25 0.9404673 0.4566147 0.1029247

# try with new data that should throw an error
df <- data.frame(a_new = 1:5,b=11:15,c=21:25)

addnew(df,'new')
Error in addnew(df, "new") : 
  Error! Possible duplicate names with your new variable names

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM