简体   繁体   English

按列分组的数据框上R中的行之间的差异

[英]Difference between rows in R on dataframe grouped by column

I'm looking to get the difference in counts by version by app_name. 我正在寻找由app_name按版本进行计数的差异。 My dataset looks like this: app_name, version_id, count, [difference] 我的数据集如下所示:app_name,version_id,count,[difference]

Here is the dataset 这是数据集

    data = structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1, 
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L, 
200L, 200L, 250L, 250L, 15L, 36L)), .Names = c("app_name", "version_id", 
"count"), class = "data.frame", row.names = c(NA, -9L))

Given this data.frame, how can I get the lagged difference in count by both app_name & version_id? 给定此data.frame,我如何才能获得app_name和version_id计数的滞后差异? the initial (first) version diff for each app would be zero, since there would be no difference. 每个应用程序的初始(第一)版本差异为零,因为两者之间没有差异。

Here is an example of what the final results would look like with that final 'diff' column 这是最终的“ diff”列的最终结果的示例

structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1, 
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L, 
200L, 200L, 250L, 250L, 15L, 36L), diff = c(0, 20, 0, 0, 0, 1.25, 
0, 0, 2.4)), .Names = c("app_name", "version_id", "count", "diff"
), class = "data.frame", row.names = c(NA, -9L))

Try using dplyr and lag : 尝试使用dplyrlag

library(dplyr)
data %>% group_by(app_name) %>%
         mutate(diffvers = version_id - dplyr::lag(version_id, default = version_id[1]),
                diffcount = count - dplyr::lag(count, default = count[1]))

Source: local data frame [9 x 5]
Groups: app_name [3]

  app_name version_id count diffvers diffcount
    (fctr)      (dbl) (int)    (dbl)     (int)
1        a        1.0   600      0.0         0
2        a        1.1   620      0.1        20
3        a        2.3   620      1.2         0
4        b        2.0   200      0.0         0
5        b        3.1   200      1.1         0
6        b        3.3   250      0.2        50
7        b        4.0   250      0.7         0
8        c        1.1    15      0.0         0
9        c        2.4    36      1.3        21

We could use data.table . 我们可以使用data.table We convert the 'data.frame' to 'data.table' ( setDT(data) ), grouped by 'app_name', loop ( lapply(.. ) the columns specified in the .SDcols , get the difference between the current element and its lag ( shift by default has type='lag' ) and assign ( := ) the output to create new columns. 我们的“data.frame”转换为“data.table”( setDT(data) ),通过“APP_NAME”,循环(分组lapply(.. )在指定的列.SDcols ,获取当前元素之间的差异,它的lag (默认为shifttype='lag' )并分配( := )输出以创建新列。

library(data.table)#v1.9.6
setDT(data)[, c('diffvers', 'diffcount') := lapply(.SD, 
              function(x) x-shift(x, fill=x[1L])), by = app_name, .SDcols=2:3]

data
#   app_name version_id count diffvers diffcount
#1:        a        1.0   600      0.0         0
#2:        a        1.1   620      0.1        20
#3:        a        2.3   620      1.2         0
#4:        b        2.0   200      0.0         0
#5:        b        3.1   200      1.1         0
#6:        b        3.3   250      0.2        50
#7:        b        4.0   250      0.7         0
#8:        c        1.1    15      0.0         0
#9:        c        2.4    36      1.3        21

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM