![](/img/trans.png)
[英]Filter in group_by + mutate not working as in group_by + summarise in dplyr R
[英]Debugging a group_by/summarise/do code in R
我經常使用dplyr軟件包和group_by / summarise / do函數。 我經常擁有大量數據集,並且需要2到3個小時來計算我的函數(也許可以對其進行優化,但這不是主題)。
碰巧的是,經過1.5個小時的計算,我的do函數給出了一個錯誤,因為我忘記了考慮代碼中的一種特定情況。 唯一的問題是我不知道哪個迭代會產生此錯誤,並且通常來說,我必須創建循環來替換我的group_by / summary / do代碼,才能知道導致問題的數據是什么。
一個非常簡單的示例來解釋我的問題...原因通常,我使用一些復雜的,自己創建的功能,這些功能帶有很多組。
require(dplyr)
FUN <- function(x) {
for (i in 1:which(!is.na(x$value))[1])
{
print("TEST")
}
}
df <- data.frame(ID = c(rep("ID1",10), rep("ID2", 20), rep("ID3", 5)),
value= c(sample(1:100, 10), rep(NA, 20), sample(0:50, 5)))
Result <- group_by(df, ID) %>%
do(Res=FUN(.))
在這里,第二個組(ID2組)將出現錯誤,因為所有值均為NA,FUN中的循環無法工作。 要知道我的問題來自ID2,我將執行以下操作:
for (j in 1:length(unique(df$ID)))
{
Interm <- df[df$ID==unique(df$ID)[j],]
Res <- FUN(Interm)
print(j)
}
由於這個原因,我知道我的問題來自j = 2,所以來自ID2。
這樣的簡單計算就可以了,但是對於我的函數來說確實需要很長時間。 例如,正確地知道我的group_by / do代碼在45分鍾后給出了錯誤,我做了兩個循環的代碼來知道給出錯誤的數據是什么,並且1.5小時后,它仍在運行...當我將找到錯誤后,我將僅在我的函數(FUN)中添加一行以考慮此特定情況,重新運行do代碼,然后在1h之后可能會出現另一個錯誤...
一個簡單的問題:有沒有辦法知道代碼從group_by / do代碼中給出了錯誤?
謝謝
Frank的答案到目前為止是最簡單的,但這是我為中端管道調試等工作的代碼樣本。
買者自負:
dplyr
和相關軟件包的任何作者或貢獻者都沒有祝福它,甚至沒有對其進行評論; dplyr-0.7.4
,但是並沒有利用許多應該使用的“優點”,例如rlang
和/或lazyeval
; 如果/發現任何迷惑之處,歡迎提交錯誤報告。
這可能包括您想要的任何東西:
mtcars %>%
group_by(cyl) %>%
pipe_message(whichcyl = cyl[1], bestmpg = max(mpg)) %>%
summarize(mpg=mean(mpg))
# Mid-pipe message (2018-05-01 09:39:26):
# $ :List of 2
# ..$ whichcyl: num 4
# ..$ bestmpg : num 33.9
# $ :List of 2
# ..$ whichcyl: num 6
# ..$ bestmpg : num 21.4
# $ :List of 2
# ..$ whichcyl: num 8
# ..$ bestmpg : num 19.2
# # A tibble: 3 x 2
# cyl mpg
# <dbl> <dbl>
# 1 4. 26.7
# 2 6. 19.7
# 3 8. 15.1
您可以選擇只是了解正在發生的事情並快速查看數據,從而讓您看到時機,然后退出管道:
mtcars %>%
group_by(cyl) %>%
pipe_assert(all(mpg > 12), .debug=TRUE) %>%
summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]>
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
# stop(.msg, call. = FALSE)
# } else {
# .x[.indices[[.ind]], ] <- x
# return(.x)
# }
# Browse[2]>
x
# # A tibble: 14 x 11
# # Groups: cyl [1]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 18.7 8. 360. 175. 3.15 3.44 17.0 0. 0. 3. 2.
# 2 14.3 8. 360. 245. 3.21 3.57 15.8 0. 0. 3. 4.
# 3 16.4 8. 276. 180. 3.07 4.07 17.4 0. 0. 3. 3.
# 4 17.3 8. 276. 180. 3.07 3.73 17.6 0. 0. 3. 3.
# 5 15.2 8. 276. 180. 3.07 3.78 18.0 0. 0. 3. 3.
# 6 10.4 8. 472. 205. 2.93 5.25 18.0 0. 0. 3. 4.
# 7 10.4 8. 460. 215. 3.00 5.42 17.8 0. 0. 3. 4.
# 8 14.7 8. 440. 230. 3.23 5.34 17.4 0. 0. 3. 4.
# 9 15.5 8. 318. 150. 2.76 3.52 16.9 0. 0. 3. 2.
# 10 15.2 8. 304. 150. 3.15 3.44 17.3 0. 0. 3. 2.
# 11 13.3 8. 350. 245. 3.73 3.84 15.4 0. 0. 3. 4.
# 12 19.2 8. 400. 175. 3.08 3.84 17.0 0. 0. 3. 2.
# 13 15.8 8. 351. 264. 4.22 3.17 14.5 0. 1. 5. 4.
# 14 15.0 8. 301. 335. 3.54 3.57 14.6 0. 1. 5. 8.
# Browse[2]>
c
# Error: all(mpg > 12) is not TRUE ... in Group: cyl:8
或者您可以選擇更新/更改數據; 意識到這會修改管道中的數據,而不是源,所以實際上僅在開發和/或一次性修復中好:
mtcars %>%
group_by(cyl) %>%
pipe_assert(all(mpg > 12), .debug=TRUE) %>%
summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]>
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
# stop(.msg, call. = FALSE)
# } else {
# .x[.indices[[.ind]], ] <- x
# return(.x)
# }
(忽略當前的調試代碼行, if ...
,那是我的工作,並不漂亮。)我現在在調試器中,我可以查看和更改/修復數據:
# Browse[2]>
x
# ...as before...
x$mpg <- x$mpg + 1000
如果數據被更改,管道將繼續,否則它將stop
。
# Browse[2]>
c
# # A tibble: 3 x 2
# cyl mpg
# <dbl> <dbl>
# 1 4. 26.7
# 2 6. 19.7
# 3 8. 1015.
(可以更改數據,但不能更改標簽...因此,如果我們完成x$cyl <- 99
,它在其余管道中仍將顯示8
這是dplyr
不允許您更改分組變量的結果...這是一件好事,IMO。)
還有pipe_debug
總是調試,但效果pipe_debug
。 它也不會(當前)傳遞更改后的數據,因此請使用pipe_assert
(例如, pipe_assert(FALSE,.debug=TRUE)
)。
來源,也可以在我的要旨中找到 :
#' Mid-pipe assertions
#'
#' Test assertions mid-pipe. Each assertion is executed individually
#' on each group (if present) of the piped data. Any failures indicate
#' the group that caused the fail, terminating on the first failure.
#'
#' If `.debug`, then the interpreter enters the `browser()`, allowing
#' you to look at the specific data, stored as `x` (just the grouped
#' data if `is.grouped_df(.x)`, all data otherwise). If the data is
#' changed, then the altered data will be sent forward in the pipeline
#' (assuming you fixed the failed assertion), otherwise the assertion
#' will fail (as an assertion should).
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed expression(s), each must evaluate to a single
#' 'logical'; similar to [assertthat::assert_that()], rather than
#' combining expressions with `&&`, separate them by commas so that
#' better error messages can be generated.
#' @param .msg a custom error message to be printed if one of the
#' conditions is false.
#' @param .debug logical, whether to invoke [browser()] if the
#' assertion fails; if `TRUE`, then when the debugger begins on a
#' fail, the grouped data will be in the variable `x`
#' @return data.frame (unchanged)
#' @export
#' @import assertthat
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#' library(assertthat)
#'
#' mtcars %>%
#' group_by(cyl) %>%
#' pipe_assert(
#' all(cyl < 9),
#' all(mpg > 10)
#' ) %>%
#' count()
#' # # A tibble: 3 x 2
#' # cyl n
#' # <dbl> <int>
#' # 1 4 11
#' # 2 6 7
#' # 3 8 14
#'
#' # note here that the "4" group is processed first and does not fail
#' mtcars %>%
#' group_by(cyl, vs) %>%
#' pipe_assert( all(cyl < 6) ) %>%
#' count()
#' # Error: all(cyl < 6) is not TRUE ... in Group: cyl:6, vs:0
#'
#' }
pipe_assert <- function(.x, ..., .msg = NULL, .debug = FALSE) {
if (is.grouped_df(.x)) {
.indices <- lapply(attr(.x, "indices"), `+`, 1L)
.labels <- attr(.x, "labels")
} else {
.indices <- list(seq_len(nrow(.x)))
}
for (assertion in eval(substitute(alist(...)))) {
for (.ind in seq_along(.indices)) {
.out <- assertthat::see_if(eval(assertion, .x[.indices[[.ind]],]))
if (! .out) {
x <- .x[.indices[[.ind]],]
if (is.null(.msg)) .msg <- paste(deparse(assertion), "is not TRUE")
if (is.grouped_df(.x)) {
.msg <- paste(.msg,
paste("in Group:",
paste(sprintf("%s:%s", names(.labels),
sapply(.labels, function(z) as.character(z[.ind]))),
collapse = ", ")),
sep = " ... ")
}
if (.debug) {
message("#\n", paste("#", .msg), "\n# 'x' is the current data that failed the assertion.\n#\n")
browser()
}
if (identical(x, .x[.indices[[.ind]],])) {
stop(.msg, call. = FALSE)
} else {
.x[.indices[[.ind]],] <- x
return(.x)
}
}
}
}
.x # "unmodified"
}
#' Mid-pipe debugging
#'
#' Mid-pipe peek at the data, named `x` within [browser()], but
#' *changes are not preserved*.
#'
#' @param .x data.frame, potentially grouped
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#' group_by(cyl, vs) %>%
#' pipe_debug() %>%
#' count()
#'
#' }
pipe_debug <- function(.x) {
if (is.grouped_df(.x)) {
.indices <- lapply(attr(.x, "indices"), `+`, 1L)
.labels <- attr(.x, "labels")
} else {
.indices <- list(seq_len(nrow(.x)))
}
# I used 'lapply' here instead of a 'for' loop because
# browser-stepping after 'browser()' in a 'for' loop could continue
# through all of *this* code, not really meaningful; in pipe_assert
# above, since the next call after 'browser()' is 'stop()', there's
# little risk of stepping in or out of this not-meaningful code
.ign <- lapply(seq_along(.indices), function(.ind, .x) {
x <- .x[.indices[[.ind]],]
message("#",
if (is.grouped_df(.x)) {
paste("\n# in Group:",
paste(sprintf("%s:%s", names(.labels),
sapply(.labels, function(z) as.character(z[.ind]))),
collapse = ", "),
"\n")
},
"# 'x' is the current data (grouped, if appropriate).\n#\n")
browser()
NULL
}, .x = .x)
.x # "unmodified"
}
#' Mid-pipe status messaging.
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed or named expression(s) whose outputs will be
#' captured, aggregated with [utils::str()], and displayed as a
#' [base::message()]; if present, a '.' literal is replace with a
#' reference to the `data.frame` (in its entirety, not grouped)
#' @param .FUN function, typically [message()] or [warning()] (for
#' when messages are suppressed); note: if set to `warning`, the
#' argument `call.=FALSE` is appended to the arguments
#' @param .timestamp logical, if 'TRUE' then a POSIXct timestamp is
#' appended to the header of the `str`-like output (default 'TRUE')
#' @param .stropts optional list of options to pass to [utils::str()],
#' for example `list(max.level=1)`
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#' pipe_message( # unnamed
#' "starting",
#' group_size(.)
#' ) %>%
#' group_by(cyl) %>%
#' pipe_message( # named
#' msg = "grouped",
#' grps = group_size(.)
#' ) %>%
#' count() %>%
#' ungroup() %>%
#' pipe_message( # alternate function, for emphasis!
#' msg = "done",
#' .FUN = warning
#' )
#'
#' head(mtcars) %>%
#' pipe_message(
#' list(a = list(aa=1, bb=2, cc=3))
#' )
#' head(mtcars) %>%
#' pipe_message(
#' list(a = list(aa=1, bb=2, cc=3)),
#' .stropts = list(max.level = 2)
#' )
#'
#' }
pipe_message <- function(.x, ..., .FUN = message, .timestamp = TRUE, .stropts = NULL) {
.expressions <- eval(substitute(alist(...)))
if (is.grouped_df(.x)) {
.indices <- lapply(attr(.x, "indices"), `+`, 1L)
.labels <- attr(.x, "labels")
} else {
.indices <- list(seq_len(nrow(.x)))
.labels <- ""
}
lst <- mapply(function(.ind, .lbl) {
.x <- .x[.ind,,drop=FALSE]
lapply(.expressions, function(.expr) {
if (is.call(.expr)) .expr <- as.call(lapply(.expr, function(a) if (a == ".") as.symbol(".x") else a))
eval(.expr, .x)
})
}, .indices, .labels, SIMPLIFY=FALSE)
.out <- capture.output(
do.call("str", c(list(lst), .stropts))
)
.out[1] <- sprintf("Mid-pipe message%s:",
if (.timestamp) paste(" (", Sys.time(), ")", sep = ""))
do.call(.FUN, c(list(paste(.out, collapse = "\n")),
if (identical(.FUN, warning)) list(call. = FALSE)))
.x # "unmodified"
}
您仍然可以在此處進行打印:
df %>% group_by(ID) %>% do({
the_id = unique(.$ID)
cat("Working on...", the_id, "which is...", match(the_id, unique(df$ID)), "/", n_distinct(df$ID), "\n")
FUN(.)
})
哪個打印
Working on... 1 which is... 1 / 3
[1] "TEST"
Working on... 2 which is... 2 / 3
Error in 1:which(!is.na(x$value))[1] : NA/NaN argument
我通常這樣做(使用data.table而不是dplyr,但使用相同的想法)。 我意識到有更復雜的調試方法,但是對我來說它已經足夠好了。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.