在多个列中汇总data.table

Question

How do I summarize a data.table with unreliable data across multiple columns? 如何在多列中汇总不可靠数据的data.table ？

Specifically, given 具体来说，给定

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)),
                 language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user",fields))
dt
#     user behavior country language event
#  1:    3    FALSE       1        6     1
#  2:    3    FALSE       1        6     2
#  3:    3    FALSE       1        6     3
#  4:    3    FALSE       1        6     4
#  5:    3    FALSE       2        6     5
#  6:    4     TRUE       2        5     7
#  7:    4     TRUE       2        5     8
#  8:    4     TRUE       2        5     9
#  9:    4     TRUE       2        5    10
# 10:    4     TRUE       2        6     6

I want to get 我想得到

#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

(here the x .name is the most common x for the user and x .support is the share events where this top x was observed) （此处x .name是user最常用的x ， x .support是观察到此顶部x的共享事件）

without having to go through both fields by hand like this: 无需像这样手动通过两个fields ：

users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
             ][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
             ][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields . 实际的fields数是5，我想避免分别为每个字段重复相同的代码，如果我修改fields ，则必须编辑此函数。 Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere . 请注意，这是这个问题的实质内容，支持计算在别处向我解释。

As in the referenced question , my data set has about 10^7 rows, so I really need a solution that scales; 与引用的问题一样，我的数据集大约有10 ^ 7行，所以我真的需要一个可扩展的解决方案; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out] . 如果我可以像users <- users[dt.out]那样避免不必要的复制，那也很好。

Answer 1

Does this solve your problem? 这会解决您的问题吗？

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
           behavior=c(rep(FALSE,5),rep(TRUE,5)),
           country=c(rep(1,4),rep(2,6)),
           language=c(rep(6,6),rep(5,4)),
           event=1:10, key=c("user",fields))

CalculateSupport <- function(dt, name) {
  x <- dt[, .N, by = eval(paste0('user,', name))]
  setnames(x, name, 'name')
  x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
  setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
  x
}

users <- dt[, sum(behavior) > 0, by=user] 
setnames(users, "V1", "behavior")

Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)

results in 结果是

   user behavior country.name country.support language.name language.support
1:    3    FALSE            1             0.8             6              1.0
2:    4     TRUE            2             1.0             5              0.8

PS Please please take Ricardo's comment to your question seriously. PS请认真对待里卡多对你的问题的评论。 SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect. 所以有很多很乐意帮助你的人，但是你必须得到很好的尊重。

Answer 2

I can't do it in one expression, since I am not sure how to reuse a created field in a data.table expression. 我无法在一个表达式中执行此操作，因为我不确定如何在data.table表达式中重用已创建的字段。 It's also probably not the most efficient way. 它也可能不是最有效的方式。 Maybe this will make a good starting point, though. 不过，也许这会成为一个很好的起点。

#Find most common country and language for each user
summ.dt<-dt[,list(behavior.summ=sum(behavior)>0,
     country.name=dt[user==.BY[[1]],.N,by=country][N==max(N),country],
     language.name=dt[user==.BY[[1]],.N,by=language][N==max(N),language]),
by=user]

#Get support for each country and language for each user
summ.dt[,c("country.support","language.support"):=list(
     nrow(dt[user==.BY[[1]] & country==country.name])/nrow(dt[user==.BY[[1]]]),
     nrow(dt[user==.BY[[1]] & language==language.name])/nrow(dt[user==.BY[[1]]])
),by=user]

    user behavior.summ country.name language.name country.support language.support
1:    3         FALSE            1             6             0.8              1.0
2:    4          TRUE            2             5             1.0              0.8

在多个列中汇总data.table

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-04-26 23:40:12

解决方案2
1 2013-04-26 18:17:45

在多个列中汇总data.table

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-04-26 23:40:12

解决方案2 1 2013-04-26 18:17:45

解决方案1
5 已采纳 2013-04-26 23:40:12

解决方案2
1 2013-04-26 18:17:45