繁体   English   中英

Data.table:在组中引用函数,并参考每个组中的设置值。 将结果列传递给函数

[英]Data.table: Apply function over groups with reference to set value in each group. Pass resulting columns into a function

我有长格式的数据,按地理位置分组。 我想计算每个感兴趣的变量与所有其他感兴趣的变量之间的差异。 我无法弄清楚如何在单个数据表语句中有效地做到这一点,所以做了一个解决方法,在此过程中也引入了一些新的错误(我修复了那些有更多变通方法,但在这里帮助也会受到赞赏!)。

然后我想将结果列传递给ggplot函数但是无法获得推荐的方法,所以使用不推荐的方法。

library(data.table)
library(ggplot2)

set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
                      variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
                      statistic = rnorm(16) )

> results[c(1:4,13:16)]
   geography variable   statistic
1:         1    alpha -0.62645381
2:         1    bravo  0.18364332
3:         1  charlie -0.83562861
4:         1    delta  1.59528080
5:         4    alpha -0.62124058
6:         4    bravo -2.21469989
7:         4  charlie  1.12493092
8:         4    delta -0.04493361

base_variable <- "alpha"

从这一点来说,我理想地希望编写一个按地理位置分组的简单代码,然后以相同的格式返回此表,但每个变量的统计量为(base_variable - variable)。

我无法弄清楚如何做到这一点所以我的解决方法如下,任何关于更好方法的建议都值得赞赏。

# Convert to a wide table so we can do the subtraction by rows
results_wide <- dcast(results, geography ~ variable, value.var = "statistic")

   geography      alpha      bravo    charlie       delta
1:         1 -0.6264538  0.1836433 -0.8356286  1.59528080
2:         2  0.3295078 -0.8204684  0.4874291  0.73832471
3:         3  0.5757814 -0.3053884  1.5117812  0.38984324
4:         4 -0.6212406 -2.2146999  1.1249309 -0.04493361

this_is_a_hack <- as.data.table(lapply(results_wide[,-1], function(x) results_wide[, ..base_variable] - x))

   alpha.alpha bravo.alpha charlie.alpha delta.alpha
1:           0  -0.8100971     0.2091748  -2.2217346
2:           0   1.1499762    -0.1579213  -0.4088169
3:           0   0.8811697    -0.9359998   0.1859381
4:           0   1.5934593    -1.7461715  -0.5763070

名字现在搞砸了,我们没有地理位置。 为什么这样的名字? 此外,需要重新添加地理位置。

this_is_a_hack[, geography := results_wide[, geography] ]

normalise_these_names <- colnames(this_is_a_hack)
#Regex approach. Hacky and situational. 
new_names <- sub("\\.(.*)", "", normalise_these_names[normalise_these_names != "geography"] )
normalise_these_names[normalise_these_names != "geography"] <- new_names
#Makes use of the fact that geographies will appear last in the data.table, not generalisable approach.
colnames(this_is_a_hack) <- normalise_these_names 

我不再需要基本变量,因为所有的值都是零,所以我试着放弃它但是我似乎不能按照通常的方式做到这一点:

this_is_a_hack[, ..base_variable := NULL] 
Warning message:
In `[.data.table`(this_is_a_hack, , `:=`(..base_variable, NULL)) :
  Column '..base_variable' does not exist to remove

library(dplyr)
this_is_a_hack <- select(this_is_a_hack, -base_variable)

final_result <- melt(this_is_a_hack, id.vars = "geography")

> final_result[c(1:4,9:12)]
   geography variable      value
1:         1    bravo -0.8100971
2:         2    bravo  1.1499762
3:         3    bravo  0.8811697
4:         4    bravo  1.5934593
5:         1    delta -2.2217346
6:         2    delta -0.4088169
7:         3    delta  0.1859381
8:         4    delta -0.5763070

现在可以将数据可视化。 我试图将这些变量传递给绘图函数,但是引用data.table列似乎比数据帧更难。 显然你应该使用quosures将data.table变量传递给函数,但这只是错误的,所以我使用了弃用的'aes_string'函数 - 对此的帮助也很感激。

plott <- function(dataset, varx, vary, fillby) {
  # varx <- ensym(varx)
  # vary <- ensym(vary)
  # vary <- ensym(fillby)
  ggplot(dataset, 
         aes_string(x = varx, y = vary, color = fillby)) + 
    geom_point()
}

plott(dataset = final_result,
      varx = "geography",
      vary = "value",
      fillby = "variable")

# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby. 
Did you mistype the name of a data column or forget to add stat()?

一个选项是通过创建基于'变量'的逻辑条件和'base_variable'元素按'地理'分组来对'统计'进行子集化

results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
       by = geography][variable != base_variable]
# geography variable       diff
# 1:         1    bravo  0.8100971
# 2:         1  charlie -0.2091748
# 3:         1    delta  2.2217346
# 4:         2    bravo -1.1499762
# 5:         2  charlie  0.1579213
# 6:         2    delta  0.4088169
# 7:         3    bravo -0.8811697
# 8:         3  charlie  0.9359998
# 9:         3    delta -0.1859381
#10:         4    bravo -1.5934593
#11:         4  charlie  1.7461715
#12:         4    delta  0.5763070

这种事情也可以用连接来完成。 根据我的经验,对于较小的表(如本示例),“子变量+分组”方法通常更快,并且当您有数百万行时,连接方法更快。

results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][]

#     geography variable   statistic       diff
#  1:         1    bravo  0.18364332  0.8100971
#  2:         1  charlie -0.83562861 -0.2091748
#  3:         1    delta  1.59528080  2.2217346
#  4:         2    bravo -0.82046838 -1.1499762
#  5:         2  charlie  0.48742905  0.1579213
#  6:         2    delta  0.73832471  0.4088169
#  7:         3    bravo -0.30538839 -0.8811697
#  8:         3  charlie  1.51178117  0.9359998
#  9:         3    delta  0.38984324 -0.1859381
# 10:         4    bravo -2.21469989 -1.5934593
# 11:         4  charlie  1.12493092  1.7461715
# 12:         4    delta -0.04493361  0.5763070

两个基准

library(microbenchmark)
microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)

# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#  use_group 1.624204 1.801434 2.143670 2.212306 2.391793 2.654357    10  a 
#   use_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916    10   b

results <- results[rep(1:.N, 1e4)][, geography := rleid(geography)]

microbenchmark(
use_group = 
  results[, .(variable, diff = statistic - statistic[variable == base_variable]), 
           by = geography][variable != base_variable],
use_join = 
results[variable != base_variable
        ][results[variable == base_variable], on = 'geography',
          diff := statistic - i.statistic][],
times = 10
)


# Unit: milliseconds
#       expr      min        lq      mean    median        uq      max neval cld
#  use_group 97.42187 106.80935 132.42537 120.64893 143.03045 208.1996    10   b
#   use_join 19.88511  21.86214  26.22012  25.82972  29.29885  36.0853    10  a 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM