[英]Pass column names as function arguments - R
I am trying to find mean and median across categories "a" and "b" under y variable. 我试图在y变量下找到类别“ a”和“ b”的均值和中位数。 I am trying to write a function to do this calculation.
我正在尝试编写一个函数来执行此计算。 This is following sample dataset:
以下是样本数据集:
sample_data <- data.frame(x = 1:10, y = c("a","b"))
library(data.table)
sample_data_dt <- as.data.table(sample_data)
I have tried following methods but I am not able to find any elegant/simple method to pass column names as function parameters in both a data.table and a data.frame. 我尝试了以下方法,但无法找到任何优雅/简单的方法来将列名称作为data.table和data.frame中的函数参数进行传递。
One working script for data.table sample_data_dt is: data.table sample_data_dt的一种工作脚本是:
apply_statistics_4 <- function(df, on_col, by_col){
df[, list(mean_value = mean(get(on_col)), median_value = median(get(on_col))), by = get(by_col)]}
apply_statistics_4(sample_data_dt, "x", "y") #works
However, similar script does not work for data.frame on ddply function: 但是,类似的脚本不适用于ddply函数上的data.frame:
apply_statistics_5 <- function(df, on_col, by_col){
ddply(df,.(get(by_col)), summarize, mean1 = mean(get(on_col)), median1 = median(get(on_col)))}
apply_statistics_5(sample_data, "x", "y") #Does not work
# Error in get(by_col) : object 'y' not found
One working script that I found for data.frame using ddply function is: 我使用ddply函数为data.frame找到的一个工作脚本是:
apply_statistics <- function(df, on_col, by_col){
df$y1 <- eval(substitute(by_col), df)
df$x1 <- eval(substitute(on_col), df)
ddply(df,.(y1), summarize, mean1 = mean(x1), median1 = median(x1))}
d <- apply_statistics(sample_data, x, y) #Works
If you know of any other method to use column names as function parameters in R for both a data.table and a data.frame, please do share with explanations. 如果您知道将R.中的列名用作data.table和data.frame的任何其他方法,请共享说明。
Thanks. 谢谢。
You can reference the column names as follows: 您可以按以下方式引用列名:
sample_data[["y"]]
sample_data_dt[["y"]]
Another command that works similarly (although not identically) for both types is subset
, eg 对于这两种类型,类似(尽管不完全相同)的另一个命令是
subset
,例如
on_col <- "x"
subset(sample_data, select=get(on_col))
subset(sample_data_dt, select=get(on_col))
by_col <- "y"
subset(sample_data, subset=get(by_col)=="a")
subset(sample_data_dt, subset=get(by_col)=="a")
Note that the row numbers are output differently by data.table
's version of subset
and the base R version, but otherwise they are pretty much interchangeable (although data.table
is of course much faster). 请注意,行号的输出
data.table
subset
的data.table
版本和基本R版本而不同,但是否则它们几乎可以互换(尽管data.table
当然要快得多)。
It doesn't seem like it is a ddply
problem but something related to the function environment. 看来这不是一个
ddply
问题,但与功能环境有关。 I had some test here, if you define the variables in the global environment, ddply
can accept and get the result, but something curious happens when you pass the string as a variable to the function. 我在这里进行了一些测试,如果您在全局环境中定义变量,则
ddply
可以接受并获取结果,但是当您将字符串作为变量传递给函数时,会发生一些奇怪的事情。
m <- "x"
n <- "y"
apply_statistics_5 <- function(df, m, n){
ddply(df, n, summarise, mean1 = mean(get(m)), median1 = median(get(m)))
}
apply_statistics_5(sample_data, "x", "y")
y mean1 median1
1 a 5 5
2 b 6 6
This will not work if m
and n
don't exist in the global environment. 如果
m
和n
在全局环境中不存在,则此方法将不起作用。
Update : It might have something to do with the scoping issue of plyr
package mentioned here . 更新 :这可能与这里提到的
plyr
软件包的范围问题有关。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.