[英]Data table - apply the same function on several columns to create new data table columns
I am working with data.table package. 我正在使用data.table包。 I have a data table which represents users actions on a website. 我有一个表示用户在网站上的操作的数据表。 Let's say that every user can visit a website, and perform multiple actions on it. 假设每个用户都可以访问某个网站,并对其执行多项操作。 My original data table is of actions (every row is an action) and I want to aggregate this information into a new data table, grouped by user visits (every visit has a unique ID). 我的原始数据表是操作(每一行都是一个操作),我想将这些信息聚合到一个新的数据表中,按用户访问分组(每次访问都有一个唯一的ID)。 There are some fields which are shared by the actions of the same visit - for example - the user name, the user status, the visit number etc. At least one of the actions of each visit contains this info (not necessarily all of the actions). 有一些字段由同一访问的操作共享 - 例如 - 用户名,用户状态,访问号等。每次访问的至少一个操作包含此信息(不一定是所有操作) )。 I want to retrieve, for each visit (= group of actions with the same visit ID), the value of this field, and set it to the visit in the visits new data table. 我想检索每次访问(=具有相同访问ID的操作组),此字段的值,并将其设置为访问新数据表中的访问。 For example, if I have the following original data table: 例如,如果我有以下原始数据表:
VisitID ActionNum UserName UserStatus VisitNum ActionType
aaaaaaa 1 John Active 5 x
aaaaaaa 2 Active y
aaaaaaa 3 John 5 z
bbbbbbb 1 NonActive w
bbbbbbb 2 Dan 7 t
I want to have a visits data table, as following: 我想要一个访问数据表,如下所示:
VisitID UserName UserStatus VisitNum
aaaaaaa John Active 5
bbbbbbb Dan NonActive 7
I created a function that works on subset of data table (only the rows of the visit) and a field, and this function should be applied on several fields (UserName, UserStatus, VisitNum). 我创建了一个函数,它处理数据表的子集(只有访问的行)和一个字段,这个函数应该应用于几个字段(UserName,UserStatus,VisitNum)。
getGeneralField<- function(visitDT,field){
vec = visitDT[,get(field)]
return (unique(vec[vec != ""])[1])
}
The problem is that every trial to apply this function on .SD when by=VisitID results in something different than I planned... What is the best way to do it? 问题是,当by = VisitID时,每次在.SD上应用此功能的试验会产生与我计划不同的东西......最好的方法是什么? I used !="" in order to avoid blank cells. 我使用!=“”以避免空白单元格。
我们在.SDcols
指定感兴趣的列,按“VisitID”分组,循环遍历.SDcols
的列( lapply(.SD, ...
))并获取第一个非空白元素
dt[, lapply(.SD, function(x) x[nzchar(x)][1]), by = VisitID, .SDcols = 3:5]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.