簡體   English   中英

在R中調試group_by / summaryise / do代碼

[英]Debugging a group_by/summarise/do code in R

我經常使用dplyr軟件包和group_by / summarise / do函數。 我經常擁有大量數據集,並且需要2到3個小時來計算我的函數(也許可以對其進行優化,但這不是主題)。

碰巧的是,經過1.5個小時的計算,我的do函數給出了一個錯誤,因為我忘記了考慮代碼中的一種特定情況。 唯一的問題是我不知道哪個迭代會產生此錯誤,並且通常來說,我必須創建循環來替換我的group_by / summary / do代碼,才能知道導致問題的數據是什么。

一個非常簡單的示例來解釋我的問題...原因通常,我使用一些復雜的,自己創建的功能,這些功能帶有很多組。

require(dplyr)

FUN <- function(x) {
  for (i in 1:which(!is.na(x$value))[1])
  {
    print("TEST")
  }
}

df <- data.frame(ID = c(rep("ID1",10), rep("ID2", 20), rep("ID3", 5)),
                 value= c(sample(1:100, 10), rep(NA, 20), sample(0:50, 5)))

Result <- group_by(df, ID) %>%
  do(Res=FUN(.))

在這里,第二個組(ID2組)將出現錯誤,因為所有值均為NA,FUN中的循環無法工作。 要知道我的問題來自ID2,我將執行以下操作:

for (j in 1:length(unique(df$ID)))
{
  Interm <- df[df$ID==unique(df$ID)[j],]
  Res <- FUN(Interm)
  print(j)
}

由於這個原因,我知道我的問題來自j = 2,所以來自ID2。

這樣的簡單計算就可以了,但是對於我的函數來說確實需要很長時間。 例如,正確地知道我的group_by / do代碼在45分鍾后給出了錯誤,我做了兩個循環的代碼來知道給出錯誤的數據是什么,並且1.5小時后,它仍在運行...當我將找到錯誤后,我將僅在我的函數(FUN)中添加一行以考慮此特定情況,重新運行do代碼,然后在1h之后可能會出現另一個錯誤...

一個簡單的問題:有沒有辦法知道代碼從group_by / do代碼中給出了錯誤?

謝謝

Frank的答案到目前為止是最簡單的,但這是我為中端管道調試等工作的代碼樣本。

買者自負:

  • 該代碼未經測試;
  • 即使經過充分測試,也無意將其用於生產或無人值守的使用;
  • dplyr和相關軟件包的任何作者或貢獻者都沒有祝福它,甚至沒有對其進行評論;
  • 它當前可以在R-3.4和dplyr-0.7.4 ,但是並沒有利用許多應該使用的“優點”,例如rlang和/或lazyeval
  • 它適合我使用,未經您測試。

如果/發現任何迷惑之處,歡迎提交錯誤報告。

中管消息

這可能包括您想要的任何東西:

mtcars %>%
  group_by(cyl) %>%
  pipe_message(whichcyl = cyl[1], bestmpg = max(mpg)) %>%
  summarize(mpg=mean(mpg))
# Mid-pipe message (2018-05-01 09:39:26):
#  $ :List of 2
#   ..$ whichcyl: num 4
#   ..$ bestmpg : num 33.9
#  $ :List of 2
#   ..$ whichcyl: num 6
#   ..$ bestmpg : num 21.4
#  $ :List of 2
#   ..$ whichcyl: num 8
#   ..$ bestmpg : num 19.2
# # A tibble: 3 x 2
#     cyl   mpg
#   <dbl> <dbl>
# 1    4.  26.7
# 2    6.  19.7
# 3    8.  15.1

中管斷言

您可以選擇只是了解正在發生的事情並快速查看數據,從而讓您看到時機,然后退出管道:

mtcars %>%
  group_by(cyl) %>%
  pipe_assert(all(mpg > 12), .debug=TRUE) %>%
  summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]> 
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
#     stop(.msg, call. = FALSE)
# } else {
#     .x[.indices[[.ind]], ] <- x
#     return(.x)
# }
# Browse[2]> 
x
# # A tibble: 14 x 11
# # Groups:   cyl [1]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  18.7    8.  360.  175.  3.15  3.44  17.0    0.    0.    3.    2.
#  2  14.3    8.  360.  245.  3.21  3.57  15.8    0.    0.    3.    4.
#  3  16.4    8.  276.  180.  3.07  4.07  17.4    0.    0.    3.    3.
#  4  17.3    8.  276.  180.  3.07  3.73  17.6    0.    0.    3.    3.
#  5  15.2    8.  276.  180.  3.07  3.78  18.0    0.    0.    3.    3.
#  6  10.4    8.  472.  205.  2.93  5.25  18.0    0.    0.    3.    4.
#  7  10.4    8.  460.  215.  3.00  5.42  17.8    0.    0.    3.    4.
#  8  14.7    8.  440.  230.  3.23  5.34  17.4    0.    0.    3.    4.
#  9  15.5    8.  318.  150.  2.76  3.52  16.9    0.    0.    3.    2.
# 10  15.2    8.  304.  150.  3.15  3.44  17.3    0.    0.    3.    2.
# 11  13.3    8.  350.  245.  3.73  3.84  15.4    0.    0.    3.    4.
# 12  19.2    8.  400.  175.  3.08  3.84  17.0    0.    0.    3.    2.
# 13  15.8    8.  351.  264.  4.22  3.17  14.5    0.    1.    5.    4.
# 14  15.0    8.  301.  335.  3.54  3.57  14.6    0.    1.    5.    8.
# Browse[2]> 
c
# Error: all(mpg > 12) is not TRUE ... in Group: cyl:8

或者您可以選擇更新/更改數據; 意識到這會修改管道中的數據,而不是源,所以實際上僅在開發和/或一次性修復中好:

mtcars %>%
  group_by(cyl) %>%
  pipe_assert(all(mpg > 12), .debug=TRUE) %>%
  summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]> 
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
#     stop(.msg, call. = FALSE)
# } else {
#     .x[.indices[[.ind]], ] <- x
#     return(.x)
# }

(忽略當前的調試代碼行, if ... ,那是我的工作,並不漂亮。)我現在在調試器中,我可以查看和更改/修復數據:

# Browse[2]> 
x
# ...as before...
x$mpg <- x$mpg + 1000

如果數據被更改,管道將繼續,否則它將stop

# Browse[2]> 
c
# # A tibble: 3 x 2
#     cyl    mpg
#   <dbl>  <dbl>
# 1    4.   26.7
# 2    6.   19.7
# 3    8. 1015. 

(可以更改數據,但不能更改標簽...因此,如果我們完成x$cyl <- 99 ,它在其余管道中仍將顯示8這是dplyr不允許您更改分組變量的結果...這是一件好事,IMO。)

還有pipe_debug總是調試,但效果pipe_debug 它也不會(當前)傳遞更改后的數據,因此請使用pipe_assert (例如, pipe_assert(FALSE,.debug=TRUE) )。


來源,也可以在我的要旨中找到

#' Mid-pipe assertions
#'
#' Test assertions mid-pipe. Each assertion is executed individually
#' on each group (if present) of the piped data. Any failures indicate
#' the group that caused the fail, terminating on the first failure.
#'
#' If `.debug`, then the interpreter enters the `browser()`, allowing
#' you to look at the specific data, stored as `x` (just the grouped
#' data if `is.grouped_df(.x)`, all data otherwise). If the data is
#' changed, then the altered data will be sent forward in the pipeline
#' (assuming you fixed the failed assertion), otherwise the assertion
#' will fail (as an assertion should).
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed expression(s), each must evaluate to a single
#'   'logical'; similar to [assertthat::assert_that()], rather than
#'   combining expressions with `&&`, separate them by commas so that
#'   better error messages can be generated.
#' @param .msg a custom error message to be printed if one of the
#'   conditions is false.
#' @param .debug logical, whether to invoke [browser()] if the
#'   assertion fails; if `TRUE`, then when the debugger begins on a
#'   fail, the grouped data will be in the variable `x`
#' @return data.frame (unchanged)
#' @export
#' @import assertthat
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#' library(assertthat)
#'
#' mtcars %>%
#'   group_by(cyl) %>%
#'   pipe_assert(
#'     all(cyl < 9),
#'     all(mpg > 10)
#'   ) %>%
#'   count()
#' # # A tibble: 3 x 2
#' #     cyl     n
#' #   <dbl> <int>
#' # 1     4    11
#' # 2     6     7
#' # 3     8    14
#' 
#' # note here that the "4" group is processed first and does not fail
#' mtcars %>%
#'   group_by(cyl, vs) %>%
#'   pipe_assert( all(cyl < 6) ) %>%
#'   count()
#' # Error: all(cyl < 6) is not TRUE ... in Group: cyl:6, vs:0
#'
#' }
pipe_assert <- function(.x, ..., .msg = NULL, .debug = FALSE) {
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
  }
  for (assertion in eval(substitute(alist(...)))) {
    for (.ind in seq_along(.indices)) {
      .out <- assertthat::see_if(eval(assertion, .x[.indices[[.ind]],]))
      if (! .out) {
        x <- .x[.indices[[.ind]],]
        if (is.null(.msg)) .msg <- paste(deparse(assertion), "is not TRUE")
        if (is.grouped_df(.x)) {
          .msg <- paste(.msg,
                        paste("in Group:",
                              paste(sprintf("%s:%s", names(.labels),
                                            sapply(.labels, function(z) as.character(z[.ind]))),
                                    collapse = ", ")),
                        sep = " ... ")
        }
        if (.debug) {
          message("#\n", paste("#", .msg), "\n# 'x' is the current data that failed the assertion.\n#\n")
          browser()
        }
        if (identical(x, .x[.indices[[.ind]],])) {
          stop(.msg, call. = FALSE)
        } else {
          .x[.indices[[.ind]],] <- x
          return(.x)
        }
      }
    }
  }
  .x # "unmodified"
}

#' Mid-pipe debugging
#'
#' Mid-pipe peek at the data, named `x` within [browser()], but
#' *changes are not preserved*.
#'
#' @param .x data.frame, potentially grouped
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#'   group_by(cyl, vs) %>%
#'   pipe_debug() %>%
#'   count()
#'
#' }
pipe_debug <- function(.x) {
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
  }
  # I used 'lapply' here instead of a 'for' loop because
  # browser-stepping after 'browser()' in a 'for' loop could continue
  # through all of *this* code, not really meaningful; in pipe_assert
  # above, since the next call after 'browser()' is 'stop()', there's
  # little risk of stepping in or out of this not-meaningful code
  .ign <- lapply(seq_along(.indices), function(.ind, .x) {
    x <- .x[.indices[[.ind]],]
    message("#",
            if (is.grouped_df(.x)) {
              paste("\n# in Group:",
                    paste(sprintf("%s:%s", names(.labels),
                                  sapply(.labels, function(z) as.character(z[.ind]))),
                          collapse = ", "),
                    "\n")
            },
            "# 'x' is the current data (grouped, if appropriate).\n#\n")
    browser()
    NULL
  }, .x = .x)
  .x # "unmodified"
}

#' Mid-pipe status messaging.
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed or named expression(s) whose outputs will be
#'   captured, aggregated with [utils::str()], and displayed as a
#'   [base::message()]; if present, a '.' literal is replace with a
#'   reference to the `data.frame` (in its entirety, not grouped)
#' @param .FUN function, typically [message()] or [warning()] (for
#'   when messages are suppressed); note: if set to `warning`, the
#'   argument `call.=FALSE` is appended to the arguments
#' @param .timestamp logical, if 'TRUE' then a POSIXct timestamp is
#'   appended to the header of the `str`-like output (default 'TRUE')
#' @param .stropts optional list of options to pass to [utils::str()],
#'   for example `list(max.level=1)`
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#'   pipe_message(           # unnamed
#'     "starting",
#'     group_size(.)
#'   ) %>%
#'   group_by(cyl) %>%
#'   pipe_message(           # named
#'     msg  = "grouped",
#'     grps = group_size(.)
#'   ) %>%
#'   count() %>%
#'   ungroup() %>%
#'   pipe_message(           # alternate function, for emphasis!
#'     msg = "done",
#'     .FUN = warning
#'   )
#'
#' head(mtcars) %>%
#'   pipe_message(
#'     list(a = list(aa=1, bb=2, cc=3))
#'   )
#' head(mtcars) %>%
#'   pipe_message(
#'     list(a = list(aa=1, bb=2, cc=3)),
#'     .stropts = list(max.level = 2)
#'   )
#'
#' }
pipe_message <- function(.x, ..., .FUN = message, .timestamp = TRUE, .stropts = NULL) {
  .expressions <- eval(substitute(alist(...)))
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
    .labels <- ""
  }
  lst <- mapply(function(.ind, .lbl) {
    .x <- .x[.ind,,drop=FALSE]
    lapply(.expressions, function(.expr) {
      if (is.call(.expr)) .expr <- as.call(lapply(.expr, function(a) if (a == ".") as.symbol(".x") else a))
      eval(.expr, .x)
    })
  }, .indices, .labels, SIMPLIFY=FALSE)
  .out <- capture.output(
    do.call("str", c(list(lst), .stropts))
  )
  .out[1] <- sprintf("Mid-pipe message%s:",
                     if (.timestamp) paste(" (", Sys.time(), ")", sep = ""))
  do.call(.FUN, c(list(paste(.out, collapse = "\n")),
                  if (identical(.FUN, warning)) list(call. = FALSE)))
  .x # "unmodified"
}

您仍然可以在此處進行打印:

df %>% group_by(ID) %>% do({
  the_id = unique(.$ID)
  cat("Working on...", the_id, "which is...", match(the_id, unique(df$ID)), "/", n_distinct(df$ID), "\n")
  FUN(.)
})

哪個打印

Working on... 1 which is... 1 / 3 
[1] "TEST"
Working on... 2 which is... 2 / 3 
Error in 1:which(!is.na(x$value))[1] : NA/NaN argument

我通常這樣做(使用data.table而不是dplyr,但使用相同的想法)。 我意識到有更復雜的調試方法,但是對我來說它已經足夠好了。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM