简体   繁体   English

如何通过字符变量在ddply中指定列名?

[英]How to specify a column name in ddply via character variable?

I have a tibble/dataframe with我有一个 tibble/dataframe

sample_id     condition     state
---------------------------------
sample1       case          val1
sample1       case          val2
sample1       case          val3
sample2       control       val1
sample2       control       val2
sample2       control       val3

The dataframe is generated within a for loop for different states.数据帧是在不同状态的 for 循环中生成的。 Hence, every dataframe has a different name for the state column.因此,每个数据框的状态列都有不同的名称。

I want to group the data by sample_id and calculate the median of the state column such that every unique sample_id has a single median value.我想按sample_id对数据进行分组并计算 state 列的中值,以便每个唯一的sample_id都有一个中值。 The output should be like below...输出应该如下所示...

sample_id     condition     state
---------------------------------
sample1       case          median
sample2       control       median

I am trying the command below;我正在尝试下面的命令; it is working if give the name of the column, but I am not able to pass the name via the state character variable.如果给出列的名称,它就可以工作,但我无法通过状态字符变量传递名称。 I tried ensym(state) and !!ensym(state) , but they all are throwing errors.我尝试了ensym(state)!!ensym(state) ,但它们都在抛出错误。

ddply(dat_state, .(sample_id), summarize,  condition=unique(condition), state_exp=median(ensym(state)))

As camille notes above, this is easier in dplyr.正如上面卡米尔所说,这在 dplyr 中更容易。 Basic syntax (not yet addressing your question):基本语法(尚未解决您的问题):

my_df %>% 
  group_by(sample_id, condition) %>% 
  summarize(state = median(state))

Note that syntax will give you values for every unique sample_id - condition pair.请注意,语法将为您提供每个唯一的sample_id - condition对的值。 Which isn't an issue in your example, since every sample_id has the same condition , but just something to be aware of.这在您的示例中不是问题,因为每个sample_id都具有相同的condition ,但只是需要注意的事项。

On to your question... It's not quite clear to me how you're planning to pass the state name to your calculation.关于你的问题......我不太清楚你打算如何将州名传递给你的计算。 But a couple ways you can handle this.但是有几种方法可以处理这个问题。 One is to use dplyr's "rename" function:一种是使用dplyr的“重命名”功能:

x <- "Massachusetts"
my_df %>% 
  rename(state = x) %>% 
  group_by(sample_id, condition) %>% 
  summarize(state = median(state))

The (probably more proper) way to do this is to write a function using dplyr's "tidyeval" syntax:这样做的(可能更合适的)方法是使用 dplyr 的“tidyeval”语法编写一个函数:

myfunc <- function(df, state_name) {
  df %>% 
    group_by(sample_id, condition) %>% 
    summarize(state = median({{state_name}}))
}

myfunc(my_df, Massachusetts) # Note: Unquoted state name

Thank you all for putting effort into answering my question.感谢大家努力回答我的问题。 With your suggestions, I have found the solution.根据您的建议,我找到了解决方案。 Below is the code to what I was trying to achieve by grouping sample_id and condition and passing state through a variable.下面是我试图通过将sample_idcondition分组并通过变量传递state来实现的代码。

state_mark <- c("pPCLg2", "STAT1", "STAT5", "AKT")

for(state in state_mark){
    dat_state <- dat_clust_stim[,c("sample_id", "condition", state)]

    # I had to use !!ensym() to convert a character to a symbol.
    dat_med <- group_by(dat_state, sample_id, condition) %>% 
               summarise(med = median(!!ensym(state)))

    dat_med <- ungroup(dat_med)
    x <- dat_med[dat_med$condition == "case", "med"]
    y <- dat_med[dat_med$condition == "control", "med"]
    t_test <- t.test(x$med, y$med)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM