[英]Apply a function across groups and columns in data.table and/or dplyr
[英]Data.table: Apply function over groups with reference to set value in each group. Pass resulting columns into a function
我有长格式的数据,按地理位置分组。 我想计算每个感兴趣的变量与所有其他感兴趣的变量之间的差异。 我无法弄清楚如何在单个数据表语句中有效地做到这一点,所以做了一个解决方法,在此过程中也引入了一些新的错误(我修复了那些有更多变通方法,但在这里帮助也会受到赞赏!)。
然后我想将结果列传递给ggplot函数但是无法获得推荐的方法,所以使用不推荐的方法。
library(data.table)
library(ggplot2)
set.seed(1)
results <- data.table(geography = rep(1:4, each = 4),
variable = rep(c("alpha", "bravo", "charlie", "delta"), 4),
statistic = rnorm(16) )
> results[c(1:4,13:16)]
geography variable statistic
1: 1 alpha -0.62645381
2: 1 bravo 0.18364332
3: 1 charlie -0.83562861
4: 1 delta 1.59528080
5: 4 alpha -0.62124058
6: 4 bravo -2.21469989
7: 4 charlie 1.12493092
8: 4 delta -0.04493361
base_variable <- "alpha"
从这一点来说,我理想地希望编写一个按地理位置分组的简单代码,然后以相同的格式返回此表,但每个变量的统计量为(base_variable - variable)。
我无法弄清楚如何做到这一点所以我的解决方法如下,任何关于更好方法的建议都值得赞赏。
# Convert to a wide table so we can do the subtraction by rows
results_wide <- dcast(results, geography ~ variable, value.var = "statistic")
geography alpha bravo charlie delta
1: 1 -0.6264538 0.1836433 -0.8356286 1.59528080
2: 2 0.3295078 -0.8204684 0.4874291 0.73832471
3: 3 0.5757814 -0.3053884 1.5117812 0.38984324
4: 4 -0.6212406 -2.2146999 1.1249309 -0.04493361
this_is_a_hack <- as.data.table(lapply(results_wide[,-1], function(x) results_wide[, ..base_variable] - x))
alpha.alpha bravo.alpha charlie.alpha delta.alpha
1: 0 -0.8100971 0.2091748 -2.2217346
2: 0 1.1499762 -0.1579213 -0.4088169
3: 0 0.8811697 -0.9359998 0.1859381
4: 0 1.5934593 -1.7461715 -0.5763070
名字现在搞砸了,我们没有地理位置。 为什么这样的名字? 此外,需要重新添加地理位置。
this_is_a_hack[, geography := results_wide[, geography] ]
normalise_these_names <- colnames(this_is_a_hack)
#Regex approach. Hacky and situational.
new_names <- sub("\\.(.*)", "", normalise_these_names[normalise_these_names != "geography"] )
normalise_these_names[normalise_these_names != "geography"] <- new_names
#Makes use of the fact that geographies will appear last in the data.table, not generalisable approach.
colnames(this_is_a_hack) <- normalise_these_names
我不再需要基本变量,因为所有的值都是零,所以我试着放弃它但是我似乎不能按照通常的方式做到这一点:
this_is_a_hack[, ..base_variable := NULL]
Warning message:
In `[.data.table`(this_is_a_hack, , `:=`(..base_variable, NULL)) :
Column '..base_variable' does not exist to remove
library(dplyr)
this_is_a_hack <- select(this_is_a_hack, -base_variable)
final_result <- melt(this_is_a_hack, id.vars = "geography")
> final_result[c(1:4,9:12)]
geography variable value
1: 1 bravo -0.8100971
2: 2 bravo 1.1499762
3: 3 bravo 0.8811697
4: 4 bravo 1.5934593
5: 1 delta -2.2217346
6: 2 delta -0.4088169
7: 3 delta 0.1859381
8: 4 delta -0.5763070
现在可以将数据可视化。 我试图将这些变量传递给绘图函数,但是引用data.table列似乎比数据帧更难。 显然你应该使用quosures将data.table变量传递给函数,但这只是错误的,所以我使用了弃用的'aes_string'函数 - 对此的帮助也很感激。
plott <- function(dataset, varx, vary, fillby) {
# varx <- ensym(varx)
# vary <- ensym(vary)
# vary <- ensym(fillby)
ggplot(dataset,
aes_string(x = varx, y = vary, color = fillby)) +
geom_point()
}
plott(dataset = final_result,
varx = "geography",
vary = "value",
fillby = "variable")
# Error I get when I try the ensym(...) method in the function:
Don't know how to automatically pick scale for object of type name. Defaulting to continuous. (this message happens 3 times)
Error: Aesthetics must be valid data columns. Problematic aesthetic(s): x = varx, y = vary, colour = fillby.
Did you mistype the name of a data column or forget to add stat()?
一个选项是通过创建基于'变量'的逻辑条件和'base_variable'元素按'地理'分组来对'统计'进行子集化
results[, .(variable, diff = statistic - statistic[variable == base_variable]),
by = geography][variable != base_variable]
# geography variable diff
# 1: 1 bravo 0.8100971
# 2: 1 charlie -0.2091748
# 3: 1 delta 2.2217346
# 4: 2 bravo -1.1499762
# 5: 2 charlie 0.1579213
# 6: 2 delta 0.4088169
# 7: 3 bravo -0.8811697
# 8: 3 charlie 0.9359998
# 9: 3 delta -0.1859381
#10: 4 bravo -1.5934593
#11: 4 charlie 1.7461715
#12: 4 delta 0.5763070
这种事情也可以用连接来完成。 根据我的经验,对于较小的表(如本示例),“子变量+分组”方法通常更快,并且当您有数百万行时,连接方法更快。
results[variable != base_variable
][results[variable == base_variable], on = 'geography',
diff := statistic - i.statistic][]
# geography variable statistic diff
# 1: 1 bravo 0.18364332 0.8100971
# 2: 1 charlie -0.83562861 -0.2091748
# 3: 1 delta 1.59528080 2.2217346
# 4: 2 bravo -0.82046838 -1.1499762
# 5: 2 charlie 0.48742905 0.1579213
# 6: 2 delta 0.73832471 0.4088169
# 7: 3 bravo -0.30538839 -0.8811697
# 8: 3 charlie 1.51178117 0.9359998
# 9: 3 delta 0.38984324 -0.1859381
# 10: 4 bravo -2.21469989 -1.5934593
# 11: 4 charlie 1.12493092 1.7461715
# 12: 4 delta -0.04493361 0.5763070
两个基准
library(microbenchmark)
microbenchmark(
use_group =
results[, .(variable, diff = statistic - statistic[variable == base_variable]),
by = geography][variable != base_variable],
use_join =
results[variable != base_variable
][results[variable == base_variable], on = 'geography',
diff := statistic - i.statistic][],
times = 10
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# use_group 1.624204 1.801434 2.143670 2.212306 2.391793 2.654357 10 a
# use_join 6.297842 6.808610 7.626004 7.729634 8.337635 8.708916 10 b
results <- results[rep(1:.N, 1e4)][, geography := rleid(geography)]
microbenchmark(
use_group =
results[, .(variable, diff = statistic - statistic[variable == base_variable]),
by = geography][variable != base_variable],
use_join =
results[variable != base_variable
][results[variable == base_variable], on = 'geography',
diff := statistic - i.statistic][],
times = 10
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# use_group 97.42187 106.80935 132.42537 120.64893 143.03045 208.1996 10 b
# use_join 19.88511 21.86214 26.22012 25.82972 29.29885 36.0853 10 a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.