[英]Pass a data.frame column name to a function
I'm trying to write a function to accept a data.frame ( x
) and a column
from it.我正在尝试编写一个 function 来接受一个 data.frame (
x
) 和它的一column
。 The function performs some calculations on x and later returns another data.frame. function 对 x 执行一些计算,然后返回另一个 data.frame。 I'm stuck on the best-practices method to pass the column name to the function.
我坚持使用最佳实践方法将列名传递给 function。
The two minimal examples fun1
and fun2
below produce the desired result, being able to perform operations on x$column
, using max()
as an example.下面的两个最小示例
fun1
和fun2
产生了期望的结果,能够对x$column
执行操作,以max()
为例。 However, both rely on the seemingly (at least to me) inelegant然而,两者都依赖于看似(至少对我而言)不雅的
substitute()
and possibly eval()
substitute()
和eval()
fun1 <- function(x, column){
do.call("max", list(substitute(x[a], list(a = column))))
}
fun2 <- function(x, column){
max(eval((substitute(x[a], list(a = column)))))
}
df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")
I would like to be able to call the function as fun(df, B)
, for example.例如,我希望能够将 function 称为
fun(df, B)
。 Other options I have considered but have not tried:我考虑过但没有尝试过的其他选项:
column
as an integer of the column number.column
作为列号的 integer 传递。 I think this would avoid substitute()
.substitute()
。 Ideally, the function could accept either.with(x, get(column))
, but, even if it works, I think this would still require substitute
with(x, get(column))
,但是,即使它有效,我认为这仍然需要substitute
formula()
and match.call()
, neither of which I have much experience with.formula()
和match.call()
,我都没有太多经验。 Subquestion : Is do.call()
preferred over eval()
?子问题: do.call
do.call()
优于eval()
吗?
You can just use the column name directly:您可以直接使用列名:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))
There's no need to use substitute, eval, etc.没有必要使用替代、评估等。
You can even pass the desired function as a parameter:您甚至可以将所需的函数作为参数传递:
fun1 <- function(x, column, fn) {
fn(x[,column])
}
fun1(df, "B", max)
Alternatively, using [[
also works for selecting a single column at a time:或者,使用
[[
也适用于一次选择一列:
df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
max(x[[column]])
}
fun1(df, "B")
This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.这个答案将涵盖许多与现有答案相同的元素,但是这个问题(将列名传递给函数)经常出现,我希望有一个更全面地涵盖事物的答案。
Suppose we have a very simple data frame:假设我们有一个非常简单的数据框:
dat <- data.frame(x = 1:4,
y = 5:8)
and we'd like to write a function that creates a new column z
that is the sum of columns x
and y
.我们想编写一个函数来创建一个新列
z
,该列是x
和y
列的总和。
A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:这里一个非常常见的绊脚石是自然(但不正确)的尝试通常如下所示:
foo <- function(df,col_name,col1,col2){
df$col_name <- df$col1 + df$col2
df
}
#Call foo() like this:
foo(dat,z,x,y)
The problem here is that df$col1
doesn't evaluate the expression col1
.这里的问题是
df$col1
不计算表达式col1
。 It simply looks for a column in df
literally called col1
.它只是在
df
查找字面上称为col1
的列。 This behavior is described in ?Extract
under the section "Recursive (list-like) Objects".此行为在“递归(类列表)对象”部分下的
?Extract
进行了描述。
The simplest, and most often recommended solution is simply switch from $
to [[
and pass the function arguments as strings:最简单也是最常推荐的解决方案是简单地从
$
切换到[[
并将函数参数作为字符串传递:
new_column1 <- function(df,col_name,col1,col2){
#Create new column col_name as sum of col1 and col2
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column1(dat,"z","x","y")
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is often considered "best practice" since it is the method that is hardest to screw up.这通常被认为是“最佳实践”,因为它是最难搞砸的方法。 Passing the column names as strings is about as unambiguous as you can get.
将列名作为字符串传递是尽可能明确的。
The following two options are more advanced.以下两个选项更高级。 Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure.
许多流行软件的使用这类技术,但使用起来也需要更多的谨慎态度和技能,因为他们可以引入微妙的复杂性和失败的意料之外点。 This section of Hadley's Advanced R book is an excellent reference for some of these issues.
Hadley 的 Advanced R 书的这一部分是其中一些问题的极好参考。
If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute())
:如果你真的想避免用户输入所有这些引号,一种选择可能是使用
deparse(substitute())
将裸露的、未加引号的列名转换为字符串:
new_column2 <- function(df,col_name,col1,col2){
col_name <- deparse(substitute(col_name))
col1 <- deparse(substitute(col1))
col2 <- deparse(substitute(col2))
df[[col_name]] <- df[[col1]] + df[[col2]]
df
}
> new_column2(dat,z,x,y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1
, just with a bunch of extra work to convert bare names to strings.坦率地说,这可能有点傻,因为我们确实在做与
new_column1
相同的事情,只是做了一堆额外的工作来将裸名称转换为字符串。
Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables.最后,如果我们想获得真正看中的,我们可能会决定,而不是两列的名字传递的增加,我们希望更加灵活,并允许两个变量的其他组合。 In that case we'd likely resort to using
eval()
on an expression involving the two columns:在这种情况下,我们可能会在涉及两列的表达式上使用
eval()
:
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
Just for fun, I'm still using deparse(substitute())
for the name of the new column.只是为了好玩,我仍然使用
deparse(substitute())
作为新列的名称。 Here, all of the following will work:在这里,以下所有操作都将起作用:
> new_column3(dat,z,x+y)
x y z
1 1 5 6
2 2 6 8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
x y z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
x y z
1 1 5 5
2 2 6 12
3 3 7 21
4 4 8 32
So the short answer is basically: pass data.frame column names as strings and use [[
to select single columns.所以简短的回答基本上是:将 data.frame 列名称作为字符串传递并使用
[[
来选择单列。 Only start delving into eval
, substitute
, etc. if you really know what you're doing.只有开始钻研
eval
, substitute
等,如果你真的知道自己在做什么。
Personally I think that passing the column as a string is pretty ugly.我个人认为将列作为字符串传递非常难看。 I like to do something like:
我喜欢做这样的事情:
get.max <- function(column,data=NULL){
column<-eval(substitute(column),data, parent.frame())
max(column)
}
which will yield:这将产生:
> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5
Notice how the specification of a data.frame is optional.请注意 data.frame 的规范是如何可选的。 you can even work with functions of your columns:
您甚至可以使用列的函数:
> get.max(1/mpg,mtcars)
[1] 0.09615385
Another way is to use tidy evaluation
approach.另一种方法是使用
tidy evaluation
方法。 It is pretty straightforward to pass columns of a data frame either as strings or bare column names.将数据框的列作为字符串或裸列名称传递非常简单。 See more about
tidyeval
here . 在此处查看有关
tidyeval
更多信息。
library(rlang)
library(tidyverse)
set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))
Use column names as strings使用列名作为字符串
fun3 <- function(x, ...) {
# capture strings and create variables
dots <- ensyms(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun3(df, "B")
#> B
#> 1 1.715065
fun3(df, "B", "D")
#> B D
#> 1 1.715065 1.786913
Use bare column names使用裸列名称
fun4 <- function(x, ...) {
# capture expressions and create quosures
dots <- enquos(...)
# unquote to evaluate inside dplyr verbs
summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}
fun4(df, B)
#> B
#> 1 1.715065
fun4(df, B, D)
#> B D
#> 1 1.715065 1.786913
#>
Created on 2019-03-01 by the reprex package (v0.2.1.9000)由reprex 包(v0.2.1.9000) 于 2019 年 3 月 1 日创建
With dplyr
it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}}
around the desired column name within the function body, eg for col_name
:使用
dplyr
现在还可以通过在函数体内所需的列名周围使用双花括号{{...}}
来访问数据帧的特定列,例如col_name
:
library(tidyverse)
fun <- function(df, col_name){
df %>%
filter({{col_name}} == "test_string")
}
As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call()
could be useful as well in this case, as an alternative to deparse(substitute())
:作为一个额外的想法,如果需要将不带引号的列名传递给自定义函数,也许
match.call()
在这种情况下也很有用,作为deparse(substitute())
的替代方法:
df <- data.frame(A = 1:10, B = 2:11)
fun <- function(x, column){
arg <- match.call()
max(x[[arg$column]])
}
fun(df, A)
#> [1] 10
fun(df, B)
#> [1] 11
If there is a typo in the column name, then would be safer to stop with an error:如果列名中有拼写错误,那么停止并出现错误会更安全:
fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf
# Stop with error in case of typo
fun <- function(x, column){
arg <- match.call()
if (is.null(x[[arg$column]])) stop("Wrong column name")
max(x[[arg$column]])
}
fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10
Created on 2019-01-11 by the reprex package (v0.2.1)由reprex 包(v0.2.1) 于 2019 年 1 月 11 日创建
I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.我不认为我会使用这种方法,因为除了传递上述答案中指出的引用列名之外,还有额外的类型和复杂性,但是,这是一种方法。
Tung's answer and mgrund's answer presented tidy evaluation . Tung 的回答和mgrund 的回答给出了整洁的评价。 In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function
new_column3
).在这个答案中,我将展示我们如何使用这些概念来做类似于joran 的答案(特别是他的 function
new_column3
)的事情。 The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation.这样做的目的是更容易看出基本评估和整洁评估之间的差异,以及查看可用于整洁评估的不同语法。 You will need
rlang
and dplyr
for this.为此,您需要
rlang
和dplyr
。
Using base evaluation tools (joran's answer):使用基础评估工具(joran 的回答):
new_column3 <- function(df,col_name,expr){
col_name <- deparse(substitute(col_name))
df[[col_name]] <- eval(substitute(expr),df,parent.frame())
df
}
In the first line, substitute
is making us evaluate col_name
as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:在第一行中,
substitute
使我们将col_name
计算为一个表达式,更具体地说是一个符号(有时也称为名称),而不是 object。rlang 的替代品可以是:
ensym
- turns it into a symbol; ensym
- 把它变成一个符号;enexpr
- turns it into a expression; enexpr
- 把它变成一个表达式;enquo
- turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it. enquo
- 把它变成一个 quosure,一个表达式,它也指向 R 应该寻找变量来评估它的环境。 Most of the time, you want to have that pointer to the environment.大多数时候,您希望拥有指向环境的指针。 When you don't specifically need it, having it rarely causes problems.
当您不是特别需要它时,拥有它很少会引起问题。 Thus, most of the time you can use
enquo
.因此,大多数时候您可以使用
enquo
。 In this case, you can use ensym
to make the code easier to read, as it makes it clearer what col_name
is.在这种情况下,您可以使用
ensym
使代码更易于阅读,因为它使col_name
是什么更清楚。
Also in the first line, deparse
is turning the expression/symbol into a string.同样在第一行,
deparse
将表达式/符号转换为字符串。 You could also use as.character
or rlang::as_string
.您也可以使用
as.character
或rlang::as_string
。
In the second line, the substitute
is turning expr
into a 'full' expression (not a symbol), so ensym
is not an option anymore.在第二行中,
substitute
项将expr
转换为“完整”表达式(不是符号),因此ensym
不再是一个选项。
Also in the second line, we can now change eval
to rlang::eval_tidy
.同样在第二行,我们现在可以将
eval
更改为rlang::eval_tidy
。 Eval would still work with enexpr
, but not with a quosure. Eval 仍然可以与
enexpr
一起使用,但不能与 quosure 一起使用。 When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()
).当你有一个 quosure 时,你不需要将环境传递给评估 function (就像 joran 对
parent.frame()
所做的那样)。
One combination of the substitutions suggested above might be:上面建议的一种替代组合可能是:
new_column3 <- function(df,col_name,expr){
col_name <- as_string(ensym(col_name))
df[[col_name]] <- eval_tidy(enquo(expr), df)
df
}
We can also use the dplyr
operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name).我们还可以使用
dplyr
运算符,它允许数据屏蔽(将数据框中的列评估为变量,通过其名称调用它)。 We can change the method of transforming the symbol to character + subsetting df
using [[
with mutate
:我们可以使用
[[
和mutate
将符号转换为字符 + 子集df
的方法:
new_column3 <- function(df,col_name,expr){
col_name <- ensym(col_name)
df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}
To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !!
为了避免新列被命名为“col_name”,我们用 bang-bang
!!
operator.操作员。 Because we made an operation to the left hand side, we can't use 'normal'
=
, and must use the new syntax :=
.因为我们对左侧进行了操作,所以我们不能使用'normal'
=
,而必须使用新语法:=
。
The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{
operator:将列名转换为符号,然后使用 bang-bang 对其进行焦虑求值的常见操作有一个快捷方式:花哨的
{{
运算符:
new_column3 <- function(df,col_name,expr){
df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}
I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments.我不是 R 的评估专家,可能做了过度简化,或者使用了错误的术语,所以请在评论中纠正我。 I hope to have helped in comparing the different tools used in the answers to this question.
我希望对比较这个问题的答案中使用的不同工具有所帮助。
If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:如果您尝试在 R 包中构建此函数或只是想降低复杂性,您可以执行以下操作:
test_func <- function(df, column) {
if (column %in% colnames(df)) {
return(max(df[, column, with=FALSE]))
} else {
stop(cat(column, "not in data.frame columns."))
}
}
The argument with=FALSE
"disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation ). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.参数
with=FALSE
“禁用将列作为变量引用的能力,从而恢复“data.frame 模式”(根据CRAN 文档)。如果提供的列名在data.frame. 也可以在这里使用 tryCatch 错误处理。
this is great but is not working on datetime columns for some reason.这很好,但由于某种原因不适用于日期时间列。 it gives me this error ..Error in Ops.POSIXt(dataset[[col_name_x]], z) :
它给了我这个错误 ..Error in Ops.POSIXt(dataset[[col_name_x]], z) :
'*' not defined for "POSIXt" objects any suggestions?没有为“POSIXt”对象定义“*”有什么建议吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.