简体   繁体   English

使用 dplyr::mutate() 创建新变量而不会产生名称冲突

[英]Creating new variables with dplyr::mutate() without conflicting names

I am writing a series of functions that use dplyr internally to manipulate data.我正在编写一系列在内部使用dplyr来操作数据的函数。

There are a number of places where I'd like to add new variables to the data set as I work with it.在使用数据集时,我想在很多地方向数据集添加新变量。 However, I am not sure how to name these new variables so as to avoid overwriting variables already in the data, given that I don't know what's in the data set being passed.但是,我不确定如何命名这些新变量以避免覆盖数据中已有的变量,因为我不知道传递的数据集中有什么。

In base RI can do this:在基础 RI 中可以这样做:

df <- data.frame(a = 1:5)

df[, ncol(df)+1] <- 6:10

and it will select a name for the newly-added variable that doesn't conflict with any existing names.它将为新添加的变量选择一个不与任何现有名称冲突的名称。 I'd like to do this in dplyr rather than breaking up the consistent application of dplyr to go back to base-R.我想这样做是dplyr而不是分手的一致应用dplyr回到基地-R。

All the solutions I've thought of so far feel very kludgy, or require the use of a bunch of base-R futzing anyway that isn't any better than just adding the variable in base-R:到目前为止,我想到的所有解决方案都感觉非常笨拙,或者无论如何都需要使用一堆 base-R futzing,这并不比仅在 base-R 中添加变量好:

  1. Rename all the variables so I know what the names are重命名所有变量,以便我知道名称是什么
  2. Pull out the names() vector and use one of many methods to generate a name that isn't in the vector拉出names()向量并使用多种方法之一生成不在向量中的名称
  3. Error out if the user happens to have my internal variable names in their data (bad-practice Olympics!)如果用户在他们的数据中碰巧有我的内部变量名,则会出错(不良做法奥运会!)

Is there a straightforward way to do this in dplyr ?dplyr是否有一种直接的方法可以做到这dplyr Getting it to work in mutate would be ideal, although I suppose bind_cols or tibble::add_column would also be fine.让它在mutate工作将是理想的,尽管我认为bind_colstibble::add_column也可以。

Some things I have tried that don't work:我尝试过的一些方法不起作用:

df <- data.frame(a = 1:5)

# Gives the new variable a fixed title which might already be in there
df %>% mutate(6:10)
df %>% tibble::add_column(6:10)
df %>% mutate(NULL = 6:10)

# Error
df %>% bind_cols(6:10)
df %>% mutate( = 6:10)
df %>% mutate(!!NULL := 6:10)

# And an example of the kind of function I'm looking at:
# This function returns the original data arranged in a random order
# and also the random variable used to arrange it
arrange_random <- function(df) {
  df <- df %>%
    mutate(randomorder = runif(n())) %>%
    arrange(randomorder)

  return(df)
}

# No naming conflict, no problem!
data <- data.frame(a = 1:5)
arrange_random(data)

# Uh-oh, the original data gets lost!
data <- data.frame(randomorder = 1:5)
arrange_random(data)

I am posting this solution for now.我现在发布这个解决方案。 This sounds like a case of not knowing one's data very well, so I think one good approach is to include an if-else statement in the function.这听起来像是一个不太了解自己数据的情况,所以我认为一个好方法是在函数中包含一个if-else语句。 The logic is that the user chooses some arbitrary new name to add as a suffix to their original variable name, but the function will return an error if the new name is already included in the original data.逻辑是用户选择一些任意的新名称作为其原始变量名称的后缀,但如果新名称已包含在原始数据中,则该函数将返回错误。 Otherwise, the function runs and returns the original data plus the newly mutated data.否则,该函数运行并返回原始数据和新变异的数据。

df <- data.frame(a = 1:5, b=11:15, c=21:25)

# define function with if-else statement to catch any possible duplicates
addnew <- function(data,name='newvar'){
  if(sum(grepl(name,names(data),ignore.case=T))>0)
  {stop('Error! Possible duplicate names with your new variable names')} else{
  data1 <- data %>% mutate_all(list( ~ runif(n())))
  names(data1) <- paste0(names(data1),'_',name)
  bind_cols(data,data1)
    }
}

addnew(df,'new')

  a  b  c     a_new     b_new     c_new
1 1 11 21 0.2875775 0.0455565 0.9568333
2 2 12 22 0.7883051 0.5281055 0.4533342
3 3 13 23 0.4089769 0.8924190 0.6775706
4 4 14 24 0.8830174 0.5514350 0.5726334
5 5 15 25 0.9404673 0.4566147 0.1029247

# try with new data that should throw an error
df <- data.frame(a_new = 1:5,b=11:15,c=21:25)

addnew(df,'new')
Error in addnew(df, "new") : 
  Error! Possible duplicate names with your new variable names

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 dplyr:将顺序函数应用于变量,而无需在单个 mutate(across(...)) 中创建新变量 - dplyr: apply sequential functions to variables without creating new variables in a single mutate(across(...)) 使用mutate创建新变量时,Dplyr代码比预期慢 - Dplyr code is slower than expected when creating new variables with mutate dplyr行式突变,不使用硬编码名称 - dplyr rowwise mutate without hardcoding names 使用 `dplyr::mutate()` 从向量中指定的名称创建几个新变量 - Using `dplyr::mutate()` to create several new variables from names specified in a vector dplyr mutate_at:使用重新编码创建新变量 - dplyr mutate_at: create new variables with recode 使用 dplyr::mutate 根据字符串向量(或 tidyselect)传递的多个条件和相应的变量名称创建新变量 - Creating new variable with dplyr::mutate based on multiple conditions and corresponding variable names passed by string vector (or tidyselect) 在 dplyr 1.0.0 中使用 mutate() 和 cross() 从多个变量创建新变量 - creating new variables from multiple variable using mutate() and across() in dplyr 1.0.0 dplyr按行名进行突变 - dplyr mutate by row names dplyr :: mutate用从列名创建的动态变量 - dplyr::mutate with dynamic variables created from column names dplyr mutate:传递变量列表以创建多个新变量 - dplyr mutate: pass list of variables to create multiple new variables
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM