简体   繁体   English

Data.table:在组中引用函数,并参考每个组中的设置值。 将结果列传递给函数

[英]Data.table: Apply function over groups with reference to set value in each group. Pass resulting columns into a function

I have data in a long format which will be grouped by geographies. 我有长格式的数据,按地理位置分组。 I want to calculate the difference in each group between one of the variables of interest against all the other variables of interest. 我想计算每个感兴趣的变量与所有其他感兴趣的变量之间的差异。 I could not figure out how to do this efficiently in a single data table statement so did a workaround which also introduced some new errors along the way (I fixed those with more workarounds but help here would also be appreciated!). 我无法弄清楚如何在单个数据表语句中有效地做到这一点,所以做了一个解决方法,在此过程中也引入了一些新的错误(我修复了那些有更多变通方法,但在这里帮助也会受到赞赏!)。

I then want to pass the resulting columns into a ggplot function however cannot get the recommended methods to work so am using a deprecated method. 然后我想将结果列传递给ggplot函数但是无法获得推荐的方法,所以使用不推荐的方法。

library(data.table)
library(ggplot2)

set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
                      variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
                      statistic = rnorm(16) )

> results[c(1:4,13:16)]
   geography variable   statistic
1:         1    alpha -0.62645381
2:         1    bravo  0.18364332
3:         1  charlie -0.83562861
4:         1    delta  1.59528080
5:         4    alpha -0.62124058
6:         4    bravo -2.21469989
7:         4  charlie  1.12493092
8:         4    delta -0.04493361

base_variable <- "alpha"

From this point I ideally want to write a simple piece of code that groups by the geographies, then returns this table in the same format but with the statistic for each variable being (base_variable - variable) in each group. 从这一点来说,我理想地希望编写一个按地理位置分组的简单代码,然后以相同的格式返回此表,但每个变量的统计量为(base_variable - variable)。

I could not figure out how to do this so my workaround is below, any advice on a better method is appreciated. 我无法弄清楚如何做到这一点所以我的解决方法如下,任何关于更好方法的建议都值得赞赏。

# Convert to a wide table so we can do the subtraction by rows
results_wide <- dcast(results, geography ~ variable, value.var = "statistic")

   geography      alpha      bravo    charlie       delta
1:         1 -0.6264538  0.1836433 -0.8356286  1.59528080
2:         2  0.3295078 -0.8204684  0.4874291  0.73832471
3:         3  0.5757814 -0.3053884  1.5117812  0.38984324
4:         4 -0.6212406 -2.2146999  1.1249309 -0.04493361

this_is_a_hack <- as.data.table(lapply(results_wide[,-1], function(x) results_wide[, ..base_variable] - x))

   alpha.alpha bravo.alpha charlie.alpha delta.alpha
1:           0  -0.8100971     0.2091748  -2.2217346
2:           0   1.1499762    -0.1579213  -0.4088169
3:           0   0.8811697    -0.9359998   0.1859381
4:           0   1.5934593    -1.7461715  -0.5763070

Names are now messed up and we don't have a geography. 名字现在搞砸了,我们没有地理位置。 Why are the names like this? 为什么这样的名字? Also, need to re-add geography. 此外,需要重新添加地理位置。

this_is_a_hack[, geography := results_wide[, geography] ]

normalise_these_names <- colnames(this_is_a_hack)
#Regex approach. Hacky and situational. 
new_names <- sub("\\.(.*)", "", normalise_these_names[normalise_these_names != "geography"] )
normalise_these_names[normalise_these_names != "geography"] <- new_names
#Makes use of the fact that geographies will appear last in the data.table, not generalisable approach.
colnames(this_is_a_hack) <- normalise_these_names 

I dont need the base variable anymore as all the values are zero so I try to drop it however I cant seem to do this the usual way I do it: 我不再需要基本变量,因为所有的值都是零,所以我试着放弃它但是我似乎不能按照通常的方式做到这一点:

this_is_a_hack[, ..base_variable := NULL] 
Warning message:
In `[.data.table`(this_is_a_hack, , `:=`(..base_variable, NULL)) :
  Column '..base_variable' does not exist to remove

library(dplyr)
this_is_a_hack <- select(this_is_a_hack, -base_variable)

final_result <- melt(this_is_a_hack, id.vars = "geography")

> final_result[c(1:4,9:12)]
   geography variable      value
1:         1    bravo -0.8100971
2:         2    bravo  1.1499762
3:         3    bravo  0.8811697
4:         4    bravo  1.5934593
5:         1    delta -2.2217346
6:         2    delta -0.4088169
7:         3    delta  0.1859381
8:         4    delta -0.5763070

Data is now ready to be visualised. 现在可以将数据可视化。 I'm trying to pass these variables into a plotting function however referencing data.table columns seems to be difficult compared to dataframes. 我试图将这些变量传递给绘图函数,但是引用data.table列似乎比数据帧更难。 Apparently you should be using quosures to pass data.table variables into functions however this just errored out so I'm using the deprecated 'aes_string' function instead - help on this is also appreciated. 显然你应该使用quosures将data.table变量传递给函数,但这只是错误的,所以我使用了弃用的'aes_string'函数 - 对此的帮助也很感激。

plott <- function(dataset, varx, vary, fillby) {
  # varx <- ensym(varx)
  # vary <- ensym(vary)
  # vary <- ensym(fillby)
  ggplot(dataset, 
         aes_string(x = varx, y = vary, color = fillby)) + 
    geom_point()
}

plott(dataset = final_result,
      varx = "geography",
      vary = "value",
      fillby = "variable")

# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby. 
Did you mistype the name of a data column or forget to add stat()?

An option is to subset the 'statistic' by creating a logical condition based on 'variable' with 'base_variable' element grouped by 'geography' 一个选项是通过创建基于'变量'的逻辑条件和'base_variable'元素按'地理'分组来对'统计'进行子集化

results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
       by = geography][variable != base_variable]
# geography variable       diff
# 1:         1    bravo  0.8100971
# 2:         1  charlie -0.2091748
# 3:         1    delta  2.2217346
# 4:         2    bravo -1.1499762
# 5:         2  charlie  0.1579213
# 6:         2    delta  0.4088169
# 7:         3    bravo -0.8811697
# 8:         3  charlie  0.9359998
# 9:         3    delta -0.1859381
#10:         4    bravo -1.5934593
#11:         4  charlie  1.7461715
#12:         4    delta  0.5763070

This kind of thing can also be done with joins. 这种事情也可以用连接来完成。 In my experience the "subset variables + grouping" approach is usually faster for smaller tables (like this example), and the join approach is faster when you have millions of rows. 根据我的经验,对于较小的表(如本示例),“子变量+分组”方法通常更快,并且当您有数百万行时,连接方法更快。

results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][]

#     geography variable   statistic       diff
#  1:         1    bravo  0.18364332  0.8100971
#  2:         1  charlie -0.83562861 -0.2091748
#  3:         1    delta  1.59528080  2.2217346
#  4:         2    bravo -0.82046838 -1.1499762
#  5:         2  charlie  0.48742905  0.1579213
#  6:         2    delta  0.73832471  0.4088169
#  7:         3    bravo -0.30538839 -0.8811697
#  8:         3  charlie  1.51178117  0.9359998
#  9:         3    delta  0.38984324 -0.1859381
# 10:         4    bravo -2.21469989 -1.5934593
# 11:         4  charlie  1.12493092  1.7461715
# 12:         4    delta -0.04493361  0.5763070

Two benchmarks 两个基准

library(microbenchmark)
microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)

# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#  use_group 1.624204 1.801434 2.143670 2.212306 2.391793 2.654357    10  a 
#   use_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916    10   b

results <- results[rep(1:.N, 1e4)][, geography := rleid(geography)]

microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)


# Unit: milliseconds
#       expr      min        lq      mean    median        uq      max neval cld
#  use_group 97.42187 106.80935 132.42537 120.64893 143.03045 208.1996    10   b
#   use_join 19.88511  21.86214  26.22012  25.82972  29.29885  36.0853    10  a 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM