在R中调试group_by / summaryise / do代码

Question

我经常使用dplyr软件包和group_by / summarise / do函数。 我经常拥有大量数据集，并且需要2到3个小时来计算我的函数（也许可以对其进行优化，但这不是主题）。

碰巧的是，经过1.5个小时的计算，我的do函数给出了一个错误，因为我忘记了考虑代码中的一种特定情况。 唯一的问题是我不知道哪个迭代会产生此错误，并且通常来说，我必须创建循环来替换我的group_by / summary / do代码，才能知道导致问题的数据是什么。

一个非常简单的示例来解释我的问题...原因通常，我使用一些复杂的，自己创建的功能，这些功能带有很多组。

require(dplyr)

FUN <- function(x) {
  for (i in 1:which(!is.na(x$value))[1])
  {
    print("TEST")
  }
}

df <- data.frame(ID = c(rep("ID1",10), rep("ID2", 20), rep("ID3", 5)),
                 value= c(sample(1:100, 10), rep(NA, 20), sample(0:50, 5)))

Result <- group_by(df, ID) %>%
  do(Res=FUN(.))

在这里，第二个组（ID2组）将出现错误，因为所有值均为NA，FUN中的循环无法工作。 要知道我的问题来自ID2，我将执行以下操作：

for (j in 1:length(unique(df$ID)))
{
  Interm <- df[df$ID==unique(df$ID)[j],]
  Res <- FUN(Interm)
  print(j)
}

由于这个原因，我知道我的问题来自j = 2，所以来自ID2。

这样的简单计算就可以了，但是对于我的函数来说确实需要很长时间。 例如，正确地知道我的group_by / do代码在45分钟后给出了错误，我做了两个循环的代码来知道给出错误的数据是什么，并且1.5小时后，它仍在运行...当我将找到错误后，我将仅在我的函数（FUN）中添加一行以考虑此特定情况，重新运行do代码，然后在1h之后可能会出现另一个错误...

一个简单的问题：有没有办法知道代码从group_by / do代码中给出了错误？

谢谢

Answer 1

Frank的答案到目前为止是最简单的，但这是我为中端管道调试等工作的代码样本。

买者自负：

该代码未经测试；
即使经过充分测试，也无意将其用于生产或无人值守的使用；
dplyr和相关软件包的任何作者或贡献者都没有祝福它，甚至没有对其进行评论；
它当前可以在R-3.4和dplyr-0.7.4 ，但是并没有利用许多应该使用的“优点”，例如rlang和/或lazyeval ；
它适合我使用，未经您测试。

如果/发现任何迷惑之处，欢迎提交错误报告。

中管消息

这可能包括您想要的任何东西：

mtcars %>%
  group_by(cyl) %>%
  pipe_message(whichcyl = cyl[1], bestmpg = max(mpg)) %>%
  summarize(mpg=mean(mpg))
# Mid-pipe message (2018-05-01 09:39:26):
#  $ :List of 2
#   ..$ whichcyl: num 4
#   ..$ bestmpg : num 33.9
#  $ :List of 2
#   ..$ whichcyl: num 6
#   ..$ bestmpg : num 21.4
#  $ :List of 2
#   ..$ whichcyl: num 8
#   ..$ bestmpg : num 19.2
# # A tibble: 3 x 2
#     cyl   mpg
#   <dbl> <dbl>
# 1    4.  26.7
# 2    6.  19.7
# 3    8.  15.1

中管断言

您可以选择只是了解正在发生的事情并快速查看数据，从而让您看到时机，然后退出管道：

mtcars %>%
  group_by(cyl) %>%
  pipe_assert(all(mpg > 12), .debug=TRUE) %>%
  summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]> 
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
#     stop(.msg, call. = FALSE)
# } else {
#     .x[.indices[[.ind]], ] <- x
#     return(.x)
# }
# Browse[2]> 
x
# # A tibble: 14 x 11
# # Groups:   cyl [1]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  18.7    8.  360.  175.  3.15  3.44  17.0    0.    0.    3.    2.
#  2  14.3    8.  360.  245.  3.21  3.57  15.8    0.    0.    3.    4.
#  3  16.4    8.  276.  180.  3.07  4.07  17.4    0.    0.    3.    3.
#  4  17.3    8.  276.  180.  3.07  3.73  17.6    0.    0.    3.    3.
#  5  15.2    8.  276.  180.  3.07  3.78  18.0    0.    0.    3.    3.
#  6  10.4    8.  472.  205.  2.93  5.25  18.0    0.    0.    3.    4.
#  7  10.4    8.  460.  215.  3.00  5.42  17.8    0.    0.    3.    4.
#  8  14.7    8.  440.  230.  3.23  5.34  17.4    0.    0.    3.    4.
#  9  15.5    8.  318.  150.  2.76  3.52  16.9    0.    0.    3.    2.
# 10  15.2    8.  304.  150.  3.15  3.44  17.3    0.    0.    3.    2.
# 11  13.3    8.  350.  245.  3.73  3.84  15.4    0.    0.    3.    4.
# 12  19.2    8.  400.  175.  3.08  3.84  17.0    0.    0.    3.    2.
# 13  15.8    8.  351.  264.  4.22  3.17  14.5    0.    1.    5.    4.
# 14  15.0    8.  301.  335.  3.54  3.57  14.6    0.    1.    5.    8.
# Browse[2]> 
c
# Error: all(mpg > 12) is not TRUE ... in Group: cyl:8

或者您可以选择更新/更改数据； 意识到这会修改管道中的数据，而不是源，所以实际上仅在开发和/或一次性修复中好：

mtcars %>%
  group_by(cyl) %>%
  pipe_assert(all(mpg > 12), .debug=TRUE) %>%
  summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]> 
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
#     stop(.msg, call. = FALSE)
# } else {
#     .x[.indices[[.ind]], ] <- x
#     return(.x)
# }

（忽略当前的调试代码行， if ... ，那是我的工作，并不漂亮。）我现在在调试器中，我可以查看和更改/修复数据：

# Browse[2]> 
x
# ...as before...
x$mpg <- x$mpg + 1000

如果数据被更改，管道将继续，否则它将stop 。

# Browse[2]> 
c
# # A tibble: 3 x 2
#     cyl    mpg
#   <dbl>  <dbl>
# 1    4.   26.7
# 2    6.   19.7
# 3    8. 1015.

（可以更改数据，但不能更改标签...因此，如果我们完成x$cyl <- 99 ，它在其余管道中仍将显示8这是dplyr不允许您更改分组变量的结果...这是一件好事，IMO。）

还有pipe_debug总是调试，但效果pipe_debug 。 它也不会（当前）传递更改后的数据，因此请使用pipe_assert （例如， pipe_assert(FALSE,.debug=TRUE) ）。

来源，也可以在我的要旨中找到：

#' Mid-pipe assertions
#'
#' Test assertions mid-pipe. Each assertion is executed individually
#' on each group (if present) of the piped data. Any failures indicate
#' the group that caused the fail, terminating on the first failure.
#'
#' If `.debug`, then the interpreter enters the `browser()`, allowing
#' you to look at the specific data, stored as `x` (just the grouped
#' data if `is.grouped_df(.x)`, all data otherwise). If the data is
#' changed, then the altered data will be sent forward in the pipeline
#' (assuming you fixed the failed assertion), otherwise the assertion
#' will fail (as an assertion should).
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed expression(s), each must evaluate to a single
#'   'logical'; similar to [assertthat::assert_that()], rather than
#'   combining expressions with `&&`, separate them by commas so that
#'   better error messages can be generated.
#' @param .msg a custom error message to be printed if one of the
#'   conditions is false.
#' @param .debug logical, whether to invoke [browser()] if the
#'   assertion fails; if `TRUE`, then when the debugger begins on a
#'   fail, the grouped data will be in the variable `x`
#' @return data.frame (unchanged)
#' @export
#' @import assertthat
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#' library(assertthat)
#'
#' mtcars %>%
#'   group_by(cyl) %>%
#'   pipe_assert(
#'     all(cyl < 9),
#'     all(mpg > 10)
#'   ) %>%
#'   count()
#' # # A tibble: 3 x 2
#' #     cyl     n
#' #   <dbl> <int>
#' # 1     4    11
#' # 2     6     7
#' # 3     8    14
#' 
#' # note here that the "4" group is processed first and does not fail
#' mtcars %>%
#'   group_by(cyl, vs) %>%
#'   pipe_assert( all(cyl < 6) ) %>%
#'   count()
#' # Error: all(cyl < 6) is not TRUE ... in Group: cyl:6, vs:0
#'
#' }
pipe_assert <- function(.x, ..., .msg = NULL, .debug = FALSE) {
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
  }
  for (assertion in eval(substitute(alist(...)))) {
    for (.ind in seq_along(.indices)) {
      .out <- assertthat::see_if(eval(assertion, .x[.indices[[.ind]],]))
      if (! .out) {
        x <- .x[.indices[[.ind]],]
        if (is.null(.msg)) .msg <- paste(deparse(assertion), "is not TRUE")
        if (is.grouped_df(.x)) {
          .msg <- paste(.msg,
                        paste("in Group:",
                              paste(sprintf("%s:%s", names(.labels),
                                            sapply(.labels, function(z) as.character(z[.ind]))),
                                    collapse = ", ")),
                        sep = " ... ")
        }
        if (.debug) {
          message("#\n", paste("#", .msg), "\n# 'x' is the current data that failed the assertion.\n#\n")
          browser()
        }
        if (identical(x, .x[.indices[[.ind]],])) {
          stop(.msg, call. = FALSE)
        } else {
          .x[.indices[[.ind]],] <- x
          return(.x)
        }
      }
    }
  }
  .x # "unmodified"
}

#' Mid-pipe debugging
#'
#' Mid-pipe peek at the data, named `x` within [browser()], but
#' *changes are not preserved*.
#'
#' @param .x data.frame, potentially grouped
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#'   group_by(cyl, vs) %>%
#'   pipe_debug() %>%
#'   count()
#'
#' }
pipe_debug <- function(.x) {
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
  }
  # I used 'lapply' here instead of a 'for' loop because
  # browser-stepping after 'browser()' in a 'for' loop could continue
  # through all of *this* code, not really meaningful; in pipe_assert
  # above, since the next call after 'browser()' is 'stop()', there's
  # little risk of stepping in or out of this not-meaningful code
  .ign <- lapply(seq_along(.indices), function(.ind, .x) {
    x <- .x[.indices[[.ind]],]
    message("#",
            if (is.grouped_df(.x)) {
              paste("\n# in Group:",
                    paste(sprintf("%s:%s", names(.labels),
                                  sapply(.labels, function(z) as.character(z[.ind]))),
                          collapse = ", "),
                    "\n")
            },
            "# 'x' is the current data (grouped, if appropriate).\n#\n")
    browser()
    NULL
  }, .x = .x)
  .x # "unmodified"
}

#' Mid-pipe status messaging.
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed or named expression(s) whose outputs will be
#'   captured, aggregated with [utils::str()], and displayed as a
#'   [base::message()]; if present, a '.' literal is replace with a
#'   reference to the `data.frame` (in its entirety, not grouped)
#' @param .FUN function, typically [message()] or [warning()] (for
#'   when messages are suppressed); note: if set to `warning`, the
#'   argument `call.=FALSE` is appended to the arguments
#' @param .timestamp logical, if 'TRUE' then a POSIXct timestamp is
#'   appended to the header of the `str`-like output (default 'TRUE')
#' @param .stropts optional list of options to pass to [utils::str()],
#'   for example `list(max.level=1)`
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#'   pipe_message(           # unnamed
#'     "starting",
#'     group_size(.)
#'   ) %>%
#'   group_by(cyl) %>%
#'   pipe_message(           # named
#'     msg  = "grouped",
#'     grps = group_size(.)
#'   ) %>%
#'   count() %>%
#'   ungroup() %>%
#'   pipe_message(           # alternate function, for emphasis!
#'     msg = "done",
#'     .FUN = warning
#'   )
#'
#' head(mtcars) %>%
#'   pipe_message(
#'     list(a = list(aa=1, bb=2, cc=3))
#'   )
#' head(mtcars) %>%
#'   pipe_message(
#'     list(a = list(aa=1, bb=2, cc=3)),
#'     .stropts = list(max.level = 2)
#'   )
#'
#' }
pipe_message <- function(.x, ..., .FUN = message, .timestamp = TRUE, .stropts = NULL) {
  .expressions <- eval(substitute(alist(...)))
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
    .labels <- ""
  }
  lst <- mapply(function(.ind, .lbl) {
    .x <- .x[.ind,,drop=FALSE]
    lapply(.expressions, function(.expr) {
      if (is.call(.expr)) .expr <- as.call(lapply(.expr, function(a) if (a == ".") as.symbol(".x") else a))
      eval(.expr, .x)
    })
  }, .indices, .labels, SIMPLIFY=FALSE)
  .out <- capture.output(
    do.call("str", c(list(lst), .stropts))
  )
  .out[1] <- sprintf("Mid-pipe message%s:",
                     if (.timestamp) paste(" (", Sys.time(), ")", sep = ""))
  do.call(.FUN, c(list(paste(.out, collapse = "\n")),
                  if (identical(.FUN, warning)) list(call. = FALSE)))
  .x # "unmodified"
}

Answer 2

您仍然可以在此处进行打印：

df %>% group_by(ID) %>% do({
  the_id = unique(.$ID)
  cat("Working on...", the_id, "which is...", match(the_id, unique(df$ID)), "/", n_distinct(df$ID), "\n")
  FUN(.)
})

哪个打印

Working on... 1 which is... 1 / 3 
[1] "TEST"
Working on... 2 which is... 2 / 3 
Error in 1:which(!is.na(x$value))[1] : NA/NaN argument

我通常这样做（使用data.table而不是dplyr，但使用相同的想法）。 我意识到有更复杂的调试方法，但是对我来说它已经足够好了。

在R中调试group_by / summaryise / do代码

问题描述

2 个解决方案

解决方案1
2 2018-05-01 16:55:04

中管消息

中管断言

解决方案2
1 已采纳 2018-05-01 16:04:25

在R中调试group_by / summaryise / do代码

问题描述

2 个解决方案

解决方案1 2 2018-05-01 16:55:04

中管消息

中管断言

解决方案2 1 已采纳 2018-05-01 16:04:25

解决方案1
2 2018-05-01 16:55:04

解决方案2
1 已采纳 2018-05-01 16:04:25