![](/img/trans.png)
[英]Filter in group_by + mutate not working as in group_by + summarise in dplyr R
[英]Debugging a group_by/summarise/do code in R
我经常使用dplyr软件包和group_by / summarise / do函数。 我经常拥有大量数据集,并且需要2到3个小时来计算我的函数(也许可以对其进行优化,但这不是主题)。
碰巧的是,经过1.5个小时的计算,我的do函数给出了一个错误,因为我忘记了考虑代码中的一种特定情况。 唯一的问题是我不知道哪个迭代会产生此错误,并且通常来说,我必须创建循环来替换我的group_by / summary / do代码,才能知道导致问题的数据是什么。
一个非常简单的示例来解释我的问题...原因通常,我使用一些复杂的,自己创建的功能,这些功能带有很多组。
require(dplyr)
FUN <- function(x) {
for (i in 1:which(!is.na(x$value))[1])
{
print("TEST")
}
}
df <- data.frame(ID = c(rep("ID1",10), rep("ID2", 20), rep("ID3", 5)),
value= c(sample(1:100, 10), rep(NA, 20), sample(0:50, 5)))
Result <- group_by(df, ID) %>%
do(Res=FUN(.))
在这里,第二个组(ID2组)将出现错误,因为所有值均为NA,FUN中的循环无法工作。 要知道我的问题来自ID2,我将执行以下操作:
for (j in 1:length(unique(df$ID)))
{
Interm <- df[df$ID==unique(df$ID)[j],]
Res <- FUN(Interm)
print(j)
}
由于这个原因,我知道我的问题来自j = 2,所以来自ID2。
这样的简单计算就可以了,但是对于我的函数来说确实需要很长时间。 例如,正确地知道我的group_by / do代码在45分钟后给出了错误,我做了两个循环的代码来知道给出错误的数据是什么,并且1.5小时后,它仍在运行...当我将找到错误后,我将仅在我的函数(FUN)中添加一行以考虑此特定情况,重新运行do代码,然后在1h之后可能会出现另一个错误...
一个简单的问题:有没有办法知道代码从group_by / do代码中给出了错误?
谢谢
Frank的答案到目前为止是最简单的,但这是我为中端管道调试等工作的代码样本。
买者自负:
dplyr
和相关软件包的任何作者或贡献者都没有祝福它,甚至没有对其进行评论; dplyr-0.7.4
,但是并没有利用许多应该使用的“优点”,例如rlang
和/或lazyeval
; 如果/发现任何迷惑之处,欢迎提交错误报告。
这可能包括您想要的任何东西:
mtcars %>%
group_by(cyl) %>%
pipe_message(whichcyl = cyl[1], bestmpg = max(mpg)) %>%
summarize(mpg=mean(mpg))
# Mid-pipe message (2018-05-01 09:39:26):
# $ :List of 2
# ..$ whichcyl: num 4
# ..$ bestmpg : num 33.9
# $ :List of 2
# ..$ whichcyl: num 6
# ..$ bestmpg : num 21.4
# $ :List of 2
# ..$ whichcyl: num 8
# ..$ bestmpg : num 19.2
# # A tibble: 3 x 2
# cyl mpg
# <dbl> <dbl>
# 1 4. 26.7
# 2 6. 19.7
# 3 8. 15.1
您可以选择只是了解正在发生的事情并快速查看数据,从而让您看到时机,然后退出管道:
mtcars %>%
group_by(cyl) %>%
pipe_assert(all(mpg > 12), .debug=TRUE) %>%
summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]>
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
# stop(.msg, call. = FALSE)
# } else {
# .x[.indices[[.ind]], ] <- x
# return(.x)
# }
# Browse[2]>
x
# # A tibble: 14 x 11
# # Groups: cyl [1]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 18.7 8. 360. 175. 3.15 3.44 17.0 0. 0. 3. 2.
# 2 14.3 8. 360. 245. 3.21 3.57 15.8 0. 0. 3. 4.
# 3 16.4 8. 276. 180. 3.07 4.07 17.4 0. 0. 3. 3.
# 4 17.3 8. 276. 180. 3.07 3.73 17.6 0. 0. 3. 3.
# 5 15.2 8. 276. 180. 3.07 3.78 18.0 0. 0. 3. 3.
# 6 10.4 8. 472. 205. 2.93 5.25 18.0 0. 0. 3. 4.
# 7 10.4 8. 460. 215. 3.00 5.42 17.8 0. 0. 3. 4.
# 8 14.7 8. 440. 230. 3.23 5.34 17.4 0. 0. 3. 4.
# 9 15.5 8. 318. 150. 2.76 3.52 16.9 0. 0. 3. 2.
# 10 15.2 8. 304. 150. 3.15 3.44 17.3 0. 0. 3. 2.
# 11 13.3 8. 350. 245. 3.73 3.84 15.4 0. 0. 3. 4.
# 12 19.2 8. 400. 175. 3.08 3.84 17.0 0. 0. 3. 2.
# 13 15.8 8. 351. 264. 4.22 3.17 14.5 0. 1. 5. 4.
# 14 15.0 8. 301. 335. 3.54 3.57 14.6 0. 1. 5. 8.
# Browse[2]>
c
# Error: all(mpg > 12) is not TRUE ... in Group: cyl:8
或者您可以选择更新/更改数据; 意识到这会修改管道中的数据,而不是源,所以实际上仅在开发和/或一次性修复中好:
mtcars %>%
group_by(cyl) %>%
pipe_assert(all(mpg > 12), .debug=TRUE) %>%
summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]>
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
# stop(.msg, call. = FALSE)
# } else {
# .x[.indices[[.ind]], ] <- x
# return(.x)
# }
(忽略当前的调试代码行, if ...
,那是我的工作,并不漂亮。)我现在在调试器中,我可以查看和更改/修复数据:
# Browse[2]>
x
# ...as before...
x$mpg <- x$mpg + 1000
如果数据被更改,管道将继续,否则它将stop
。
# Browse[2]>
c
# # A tibble: 3 x 2
# cyl mpg
# <dbl> <dbl>
# 1 4. 26.7
# 2 6. 19.7
# 3 8. 1015.
(可以更改数据,但不能更改标签...因此,如果我们完成x$cyl <- 99
,它在其余管道中仍将显示8
这是dplyr
不允许您更改分组变量的结果...这是一件好事,IMO。)
还有pipe_debug
总是调试,但效果pipe_debug
。 它也不会(当前)传递更改后的数据,因此请使用pipe_assert
(例如, pipe_assert(FALSE,.debug=TRUE)
)。
来源,也可以在我的要旨中找到 :
#' Mid-pipe assertions
#'
#' Test assertions mid-pipe. Each assertion is executed individually
#' on each group (if present) of the piped data. Any failures indicate
#' the group that caused the fail, terminating on the first failure.
#'
#' If `.debug`, then the interpreter enters the `browser()`, allowing
#' you to look at the specific data, stored as `x` (just the grouped
#' data if `is.grouped_df(.x)`, all data otherwise). If the data is
#' changed, then the altered data will be sent forward in the pipeline
#' (assuming you fixed the failed assertion), otherwise the assertion
#' will fail (as an assertion should).
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed expression(s), each must evaluate to a single
#' 'logical'; similar to [assertthat::assert_that()], rather than
#' combining expressions with `&&`, separate them by commas so that
#' better error messages can be generated.
#' @param .msg a custom error message to be printed if one of the
#' conditions is false.
#' @param .debug logical, whether to invoke [browser()] if the
#' assertion fails; if `TRUE`, then when the debugger begins on a
#' fail, the grouped data will be in the variable `x`
#' @return data.frame (unchanged)
#' @export
#' @import assertthat
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#' library(assertthat)
#'
#' mtcars %>%
#' group_by(cyl) %>%
#' pipe_assert(
#' all(cyl < 9),
#' all(mpg > 10)
#' ) %>%
#' count()
#' # # A tibble: 3 x 2
#' # cyl n
#' # <dbl> <int>
#' # 1 4 11
#' # 2 6 7
#' # 3 8 14
#'
#' # note here that the "4" group is processed first and does not fail
#' mtcars %>%
#' group_by(cyl, vs) %>%
#' pipe_assert( all(cyl < 6) ) %>%
#' count()
#' # Error: all(cyl < 6) is not TRUE ... in Group: cyl:6, vs:0
#'
#' }
pipe_assert <- function(.x, ..., .msg = NULL, .debug = FALSE) {
if (is.grouped_df(.x)) {
.indices <- lapply(attr(.x, "indices"), `+`, 1L)
.labels <- attr(.x, "labels")
} else {
.indices <- list(seq_len(nrow(.x)))
}
for (assertion in eval(substitute(alist(...)))) {
for (.ind in seq_along(.indices)) {
.out <- assertthat::see_if(eval(assertion, .x[.indices[[.ind]],]))
if (! .out) {
x <- .x[.indices[[.ind]],]
if (is.null(.msg)) .msg <- paste(deparse(assertion), "is not TRUE")
if (is.grouped_df(.x)) {
.msg <- paste(.msg,
paste("in Group:",
paste(sprintf("%s:%s", names(.labels),
sapply(.labels, function(z) as.character(z[.ind]))),
collapse = ", ")),
sep = " ... ")
}
if (.debug) {
message("#\n", paste("#", .msg), "\n# 'x' is the current data that failed the assertion.\n#\n")
browser()
}
if (identical(x, .x[.indices[[.ind]],])) {
stop(.msg, call. = FALSE)
} else {
.x[.indices[[.ind]],] <- x
return(.x)
}
}
}
}
.x # "unmodified"
}
#' Mid-pipe debugging
#'
#' Mid-pipe peek at the data, named `x` within [browser()], but
#' *changes are not preserved*.
#'
#' @param .x data.frame, potentially grouped
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#' group_by(cyl, vs) %>%
#' pipe_debug() %>%
#' count()
#'
#' }
pipe_debug <- function(.x) {
if (is.grouped_df(.x)) {
.indices <- lapply(attr(.x, "indices"), `+`, 1L)
.labels <- attr(.x, "labels")
} else {
.indices <- list(seq_len(nrow(.x)))
}
# I used 'lapply' here instead of a 'for' loop because
# browser-stepping after 'browser()' in a 'for' loop could continue
# through all of *this* code, not really meaningful; in pipe_assert
# above, since the next call after 'browser()' is 'stop()', there's
# little risk of stepping in or out of this not-meaningful code
.ign <- lapply(seq_along(.indices), function(.ind, .x) {
x <- .x[.indices[[.ind]],]
message("#",
if (is.grouped_df(.x)) {
paste("\n# in Group:",
paste(sprintf("%s:%s", names(.labels),
sapply(.labels, function(z) as.character(z[.ind]))),
collapse = ", "),
"\n")
},
"# 'x' is the current data (grouped, if appropriate).\n#\n")
browser()
NULL
}, .x = .x)
.x # "unmodified"
}
#' Mid-pipe status messaging.
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed or named expression(s) whose outputs will be
#' captured, aggregated with [utils::str()], and displayed as a
#' [base::message()]; if present, a '.' literal is replace with a
#' reference to the `data.frame` (in its entirety, not grouped)
#' @param .FUN function, typically [message()] or [warning()] (for
#' when messages are suppressed); note: if set to `warning`, the
#' argument `call.=FALSE` is appended to the arguments
#' @param .timestamp logical, if 'TRUE' then a POSIXct timestamp is
#' appended to the header of the `str`-like output (default 'TRUE')
#' @param .stropts optional list of options to pass to [utils::str()],
#' for example `list(max.level=1)`
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#' pipe_message( # unnamed
#' "starting",
#' group_size(.)
#' ) %>%
#' group_by(cyl) %>%
#' pipe_message( # named
#' msg = "grouped",
#' grps = group_size(.)
#' ) %>%
#' count() %>%
#' ungroup() %>%
#' pipe_message( # alternate function, for emphasis!
#' msg = "done",
#' .FUN = warning
#' )
#'
#' head(mtcars) %>%
#' pipe_message(
#' list(a = list(aa=1, bb=2, cc=3))
#' )
#' head(mtcars) %>%
#' pipe_message(
#' list(a = list(aa=1, bb=2, cc=3)),
#' .stropts = list(max.level = 2)
#' )
#'
#' }
pipe_message <- function(.x, ..., .FUN = message, .timestamp = TRUE, .stropts = NULL) {
.expressions <- eval(substitute(alist(...)))
if (is.grouped_df(.x)) {
.indices <- lapply(attr(.x, "indices"), `+`, 1L)
.labels <- attr(.x, "labels")
} else {
.indices <- list(seq_len(nrow(.x)))
.labels <- ""
}
lst <- mapply(function(.ind, .lbl) {
.x <- .x[.ind,,drop=FALSE]
lapply(.expressions, function(.expr) {
if (is.call(.expr)) .expr <- as.call(lapply(.expr, function(a) if (a == ".") as.symbol(".x") else a))
eval(.expr, .x)
})
}, .indices, .labels, SIMPLIFY=FALSE)
.out <- capture.output(
do.call("str", c(list(lst), .stropts))
)
.out[1] <- sprintf("Mid-pipe message%s:",
if (.timestamp) paste(" (", Sys.time(), ")", sep = ""))
do.call(.FUN, c(list(paste(.out, collapse = "\n")),
if (identical(.FUN, warning)) list(call. = FALSE)))
.x # "unmodified"
}
您仍然可以在此处进行打印:
df %>% group_by(ID) %>% do({
the_id = unique(.$ID)
cat("Working on...", the_id, "which is...", match(the_id, unique(df$ID)), "/", n_distinct(df$ID), "\n")
FUN(.)
})
哪个打印
Working on... 1 which is... 1 / 3
[1] "TEST"
Working on... 2 which is... 2 / 3
Error in 1:which(!is.na(x$value))[1] : NA/NaN argument
我通常这样做(使用data.table而不是dplyr,但使用相同的想法)。 我意识到有更复杂的调试方法,但是对我来说它已经足够好了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.