简体   繁体   English

在多个列中汇总data.table

[英]Summarize a data.table across multiple columns

How do I summarize a data.table with unreliable data across multiple columns? 如何在多列中汇总不可靠数据的data.table

Specifically, given 具体来说,给定

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)),
                 language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user",fields))
dt
#     user behavior country language event
#  1:    3    FALSE       1        6     1
#  2:    3    FALSE       1        6     2
#  3:    3    FALSE       1        6     3
#  4:    3    FALSE       1        6     4
#  5:    3    FALSE       2        6     5
#  6:    4     TRUE       2        5     7
#  7:    4     TRUE       2        5     8
#  8:    4     TRUE       2        5     9
#  9:    4     TRUE       2        5    10
# 10:    4     TRUE       2        6     6

I want to get 我想得到

#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

(here the x .name is the most common x for the user and x .support is the share events where this top x was observed) (此处x .nameuser最常用的xx .support是观察到此顶部x的共享事件)

without having to go through both fields by hand like this: 无需像这样手动通过两个fields

users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
             ][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
             ][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields . 实际的fields数是5,我想避免分别为每个字段重复相同的代码,如果我修改fields ,则必须编辑此函数。 Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere . 请注意, 是这个问题的实质内容,支持计算在别处向我解释。

As in the referenced question , my data set has about 10^7 rows, so I really need a solution that scales; 与引用的问题一样 ,我的数据集大约有10 ^ 7行,所以我真的需要一个可扩展的解决方案; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out] . 如果我可以像users <- users[dt.out]那样避免不必要的复制,那也很好。

Does this solve your problem? 这会解决您的问题吗?

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
           behavior=c(rep(FALSE,5),rep(TRUE,5)),
           country=c(rep(1,4),rep(2,6)),
           language=c(rep(6,6),rep(5,4)),
           event=1:10, key=c("user",fields))

CalculateSupport <- function(dt, name) {
  x <- dt[, .N, by = eval(paste0('user,', name))]
  setnames(x, name, 'name')
  x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
  setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
  x
}

users <- dt[, sum(behavior) > 0, by=user] 
setnames(users, "V1", "behavior")

Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)

results in 结果是

   user behavior country.name country.support language.name language.support
1:    3    FALSE            1             0.8             6              1.0
2:    4     TRUE            2             1.0             5              0.8

PS Please please take Ricardo's comment to your question seriously. PS请认真对待里卡多对你的问题的评论。 SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect. 所以有很多很乐意帮助你的人,但是你必须得到很好的尊重。

I can't do it in one expression, since I am not sure how to reuse a created field in a data.table expression. 我无法在一个表达式中执行此操作,因为我不确定如何在data.table表达式中重用已创建的字段。 It's also probably not the most efficient way. 它也可能不是最有效的方式。 Maybe this will make a good starting point, though. 不过,也许这会成为一个很好的起点。

#Find most common country and language for each user
summ.dt<-dt[,list(behavior.summ=sum(behavior)>0,
     country.name=dt[user==.BY[[1]],.N,by=country][N==max(N),country],
     language.name=dt[user==.BY[[1]],.N,by=language][N==max(N),language]),
by=user]

#Get support for each country and language for each user
summ.dt[,c("country.support","language.support"):=list(
     nrow(dt[user==.BY[[1]] & country==country.name])/nrow(dt[user==.BY[[1]]]),
     nrow(dt[user==.BY[[1]] & language==language.name])/nrow(dt[user==.BY[[1]]])
),by=user]

    user behavior.summ country.name language.name country.support language.support
1:    3         FALSE            1             6             0.8              1.0
2:    4          TRUE            2             5             1.0              0.8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM