简体   繁体   English

在 data.table 和/或 dplyr 中的组和列之间应用 function

[英]Apply a function across groups and columns in data.table and/or dplyr

I would like to combine two data.tables or dataframes of unequal row #, where the # of rows of dt2 is the same as the number of groups of dt1.我想组合两个不等行的data.tables或dataframes,其中dt2的行数与dt1的组数相同。 Here is a reproducible example:这是一个可重现的示例:

a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)

The real case involves a large number of columns so I with to refer to them with variables.实际情况涉及大量列,因此我使用变量来引用它们。 Using either a loop or apply, I wish to add each row of dt2 to the rows comprising each group of dt1.使用循环或应用,我希望将 dt2 的每一行添加到组成每组 dt1 的行中。 Here is one of many attempts that fail:这是失败的众多尝试之一:

for (ic in 1:3) {
  c1 <- dt2[,(ic), with=FALSE]
  c2 <- dt2[,(ic), with=FALSE]
  dt1[,(ic) := .(c1 + c2[.G]), by = "groupVar"]
}

I am interested in how to do this kind of operation "by group and by column" in both data.table syntax and dplyr syntax.我对如何在 data.table 语法和 dplyr 语法中“按组和按列”执行这种操作很感兴趣。 In place (as above) is not critical.到位(如上所述)并不重要。

desired result:期望的结果:

dt1 (or dt3) = 
a   b   c   groupVar
11  22  33  1
12  23  34  1
13  24  35  1
24  35  46  2 
...
40  51  62  3

Assuming that the column names are consistent (eg you want a + a2, b + b2...etc), here is a tidyverse solution that starts in a similar way as @dclarson's, then uses the bang-bang operator to select the columns to add up.假设列名是一致的(例如,您想要 a + a2、b + b2...等),这里有一个 tidyverse 解决方案,它以与@dclarson 类似的方式开始,然后使用 bang-bang 运算符来 select 列加起来。

Is this what you are after?这就是你所追求的吗?

## Create tibbles and join
dt1 <- tibble(groupVar,a,b,c)
dt2 <- tibble(groupVar = 1:3,a2,b2,c2)
dt3 <- inner_join(dt1,dt2)

## Define the column starters you are interested in
cols <- c("a","b","c")
## Or in case of many columns
cols <- colnames(dt1[-1])

## Create function to add columns with the same starting letters
add_cols <- function(col){
  dt3 %>% select(starts_with(!!col)) %>% 
    transmute(!!(sym(col)) :=  !!(sym(col)) +  !!(sym(paste0(col,"2")))) 
}
## map the function and add groupVar
 map_dfc(cols,add_cols) %>% mutate(groupVar = dt3$groupVar)

    # A tibble: 10 x 4
       a     b     c groupVar
   <dbl> <dbl> <dbl>    <dbl>
 1    11    22    33        1
 2    12    23    34        1
 3    13    24    35        1
 4    24    35    46        2
 5    25    36    47        2
 6    26    37    48        2
 7    37    48    59        3
 8    38    49    60        3
 9    39    50    61        3
10    40    51    62        3

The sample datasets provided with the question indicate that the names of the columns may differ between datasets, eg, column b of dt1 and column b2 of dt2 are supposed to be added.随问题提供的样本数据集表明,数据集之间的列名称可能不同,例如,应该添加dt1b列和dt2b2列。

Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:这里有两种方法应该适用于任意数量的任意命名的列对:

  1. Working in long format以长格式工作
  2. EDIT: Update joins using get()编辑:使用get()更新连接
  3. EDIT 2: Computing on the language编辑 2:计算语言

1. Working in long format 1. 长格式工作

The information on corresponding columns can be provided in a look-up table or translation table :对应列的信息可以在查找表转换表中提供:

library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))

lut
 vars1 vars2 1: a a2 2: b b2 3: c c2

In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.在列名被视为数据并且列数据属于相同数据类型的情况下,我的第一种方法是重塑为长格式。

# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables 
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
 groupVar ab c 1: 1 11 22 33 2: 1 12 23 34 3: 1 13 24 35 4: 2 24 35 46 5: 2 25 36 47 6: 2 26 37 48 7: 3 37 48 59 8: 3 38 49 60 9: 3 39 50 61 10: 3 40 51 62

2. Update joins using get() 2. 使用get()更新连接

Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:再考虑一下,这是一种类似于 OP 提出for循环的方法,并且需要更少的编码:

vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
   
for (iv in seq_along(vars1)) {
  dt1[dt2, on = .(groupVar), 
      (vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}

dt1[]
 ab c groupVar 1: 11 22 33 1 2: 12 23 34 1 3: 13 24 35 1 4: 24 35 46 2 5: 25 36 47 2 6: 26 37 48 2 7: 37 48 59 3 8: 38 49 60 3 9: 39 50 61 3 10: 40 51 62 3

Note that dt1 is updated by reference , ie, without copying.请注意, dt1通过引用更新的,即不进行复制。

Prepending the variable names vars1[iv] by "x."在变量名vars1[iv]加上"x." and vars2[iv] by "i."vars2[iv]通过"i." on the right hand side of := is to ensure that the right columns from dt1 and dt2 , resp., are picked in case of duplicated column names.:=的右侧是确保在列名重复的情况下分别选择dt1dt2的右列。 See the Advanced: section on the j parameter in help("data.table") .请参阅help("data.table")中有关j参数的Advanced:部分。

3. Computing on the language 3. 语言计算

This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".这遵循了Matt Dowle 的建议,即创建一个要评估的表达式,“类似于构建动态 SQL 语句以发送到服务器”。 See here for another use case.有关另一个用例,请参见此处

library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
  glue_data("{vars1} = x.{vars1} + i.{vars2}") %>% 
  glue_collapse( sep = ", ") %>% 
  {glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>% 
  EVAL()
 ab c groupVar 1: 11 22 33 1 2: 12 23 34 1 3: 13 24 35 1 4: 24 35 46 2 5: 25 36 47 2 6: 26 37 48 2 7: 37 48 59 3 8: 38 49 60 3 9: 39 50 61 3 10: 40 51 62 3

It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement它从一个动态创建的查找表开始,随后对其进行操作以形成完整的 data.table 语句

dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]

as a character string.作为字符串。 This string is then evaluated and executed in one go;然后在一个 go 中评估和执行该字符串; no for loops required.不需要for循环。

As the helper function EVAL() already uses paste0() the call to glue() can be omitted:由于助手 function EVAL()已经使用paste0() ,因此可以省略对glue() ) 的调用:

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
  glue_data("{vars1} = x.{vars1} + i.{vars2}") %>% 
  glue_collapse( sep = ", ") %>% 
  {EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}

Note that dot .请注意点. and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.和大括号{}在不同的上下文中以不同的含义使用,这可能看起来有些混乱。

It is simple if you add groupVar to dt2:如果将groupVar添加到 dt2,这很简单:

dt2 <- data.table(a2, b2, c2, groupVar=1:3)
dt3 <- merge(dt1, dt2)
dt4 <- with(dt3, data.table(a=a+a2, b=b+b2, c=c+c2, groupVar))
dt4
#      a  b  c groupVar
#  1: 11 22 33        1
#  2: 12 23 34        1
#  3: 13 24 35        1
#  4: 24 35 46        2
#  5: 25 36 47        2
#  6: 26 37 48        2
#  7: 37 48 59        3
#  8: 38 49 60        3
#  9: 39 50 61        3
# 10: 40 51 62        3

This should solve your desire:这应该可以解决您的愿望:

  1. Create a groupVar in dt2 with unique groupVar from dt1dt2中使用来自dt1unique groupVar创建一个groupVar
  2. right_join by groupVar通过right_join groupVar
  3. Create new columns a , b , c with mutate使用mutate创建新列abc
  4. Keep a , b , c and groupVar as desired output根据需要保留abcgroupVar output
library(dplyr)

dt3 <- dt2 %>% 
  mutate(groupVar = unique(dt1$groupVar)) %>% 
  right_join(dt1, by="groupVar") %>% 
  mutate(a = a + a2,
         b = b + b2,
         c = c + c2) %>% 
  select(a, b, c, groupVar)

data:数据:

library(data.table)
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM