group_by() 之后的 summarize() 中自定義 function 的問題 - 所有組的結果都相同

Question

我有一些聚類數據，正在嘗試查找每個聚類的點數和面積。 我寫了一個 function 來計算該簇中所有點的凸包面積。 但是，當我嘗試通過dataframe %>% group_by() %>% summarize()傳遞它時，它無法正常工作。 它不是計算每個聚類的面積，而是計算面積，就好像數據集中的所有點都屬於一個聚類，並將其作為每個聚類的面積返回。

示例數據集

x <- c(0, 0, 5, 5, 8, 8, 10, 10 )
y <- c(0, 5, 5, 0, 0, 5, 5, 0)
cluster <- c(1, 1, 1, 1, 2, 2, 2, 2)
dat.clustered <- data.frame(x, y, cluster)

## 4 points in each cluster

簇 1 是一個 5x5 的正方形，面積為 25，簇 2 是一個 2x5 的矩形，面積為 10。如果假設數據集中的所有點都是一個簇，則它們形成一個 5x10 的矩形，面積為50（稍后這很重要）。

或者，在每個集群內添加幾個點。 這是不必要的，但它使它更適合集群場景，並證明n() summarize()中正常工作。

dat.clustered[nrow(dat.clustered) + 1,] <- c(2, 3, 1) # inside cluster 1
dat.clustered[nrow(dat.clustered) + 1,] <- c(4, 3, 1) # inside cluster 1
dat.clustered[nrow(dat.clustered) + 1,] <- c(9, 1, 2) # inside cluster 2

## there are now a total of 6 points in cluster 1 and 5 points in cluster 2

我寫了一個function來計算簇中點的凸包面積。 function Polygon()來自 package sp 。 function 基於我在此處找到的代碼。

area.calc <- function(data){
  area = chull(data) %>%
    c(., .[1]) %>%
    data[.,] %>%
    .[, 1:2] %>%
    sp::Polygon(., hole = F) %>%
    .@area
  return(area)
}

證明 function 正確計算了每個集群的面積。

area.calc(filter(dat.clustered, cluster == 1)) # 25
area.calc(filter(dat.clustered, cluster == 2)) # 10
area.calc(dat.clustered) # 50 # entire dataset as a single cluster

當單獨處理來自每個集群的數據時（或將整個數據集視為單個集群時），它工作得很好，但是當我在group_by() %>% summarize()中使用它時，它給出了錯誤的結果。

clust.smy <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = area.calc(data = .))

這給出了以下結果。 請注意，它顯示兩個集群具有相同的面積，並且由於報告的面積是 50，我的 function 顯然返回將整個數據集視為一個大集群的面積。 來自n()的每個簇中的點數是正確的。

# A tibble: 2 × 3
  cluster count  area
    <dbl> <int> <dbl>
1       1     6    50
2       2     5    50

結果應該是

簇	數數	區域
1個	6個	25
2個	5個	10

我在這里和這里發現了幾個類似的問題，但這些問題中的問題是由於在summarize()中引用df$variable引起的，這不是我在做什么。 話雖這么說，我的 function 正在返回所有被視為單個集群的點的區域這一事實讓我認為可能正在發生類似的事情。 這個類似的問題是通過在summarize()中使用cur_data()解決的，但是當我嘗試這樣做而不是使用. 占位符，我收到以下錯誤。

clust.smy <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = area.calc(data = cur_data()))

錯誤

Error in `summarize()`:
! Problem while computing `area = area.calc(data = cur_data())`.
ℹ The error occurred in group 1: cluster = 1.
Caused by error:
! error in evaluating the argument 'obj' in selecting a method for function 'coordinates': Can't subset elements past the end.
ℹ Locations 4, 2, 3, and 4 don't exist.
ℹ There is only 1 element.
Run `rlang::last_error()` to see where the error occurred.

或者，我可以嘗試僅在summarize()中使用一堆管道，而根本不使用我的area.calc() function。

clust.smy2 <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = (.) %>%
              chull() %>%
              c(., .[1]) %>%
              dat.clustered[.,] %>%
              .[, 1:2] %>%
              sp::Polygon(., hole = F) %>%
              .@area)

這會產生與上面相同的錯誤結果。

# A tibble: 2 × 3
  cluster count  area
    <dbl> <int> <dbl>
1       1     6    50
2       2     5    50

有趣的是，如果我將cur_data() （而不是(.) ）傳遞給this ，計算出的面積是相同的，但這次它們都是簇 1 的正確區域。

clust.smy2 <- dat.clustered %>%
  group_by(cluster) %>%
  summarize(count = n(),
            area = cur_data() %>%
              chull() %>%
              c(., .[1]) %>%
              dat.clustered[.,] %>%
              .[, 1:2] %>%
              sp::Polygon(., hole = F) %>%
              .@area)

> clust.smy2
# A tibble: 2 × 3
  cluster count  area
    <dbl> <int> <dbl>
1       1     6    25
2       2     5    25

顯然，關於如何將分組傳遞給summarize() ，有一些有趣的事情在發生。 我有一種感覺，我在某種程度上濫用了. 占位符，但我對dplyr和編寫函數比較陌生，一直無法弄清楚如何或找到並調整解決方案。 任何幫助表示贊賞！

Session 信息：

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.9 plyr_1.8.7 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     rstudioapi_0.13  magrittr_2.0.3   tidyselect_1.1.2 munsell_0.5.0    lattice_0.20-45  colorspace_2.0-3 R6_2.5.1        
 [9] rlang_1.0.6      factoextra_1.0.7 fansi_1.0.3      tools_4.2.0      grid_4.2.0       gtable_0.3.0     utf8_1.2.2       cli_3.4.1       
[17] DBI_1.1.3        ellipsis_0.3.2   assertthat_0.2.1 tibble_3.1.7     lifecycle_1.0.3  crayon_1.5.1     purrr_0.3.4      ggplot2_3.4.0   
[25] vctrs_0.5.1      ggrepel_0.9.2    glue_1.6.2       sp_1.5-1         compiler_4.2.0   pillar_1.7.0     generics_0.1.2   scales_1.2.0    
[33] pkgconfig_2.0.3

Answer 1

一種方法是使用tidyr::nest來細分您的 dataframe 而不是分組。 您的 function 每次都占用整個 dataframe，因此拆分將使正確的數據保持在正確的位置：

library(tidyverse)

dat.clustered %>%
  nest(data = -cluster) %>%
  summarise(
    cluster = cluster,
    n = map_int(data, nrow),
    area = map_dbl(data, area.calc)
  ) 
#> # A tibble: 2 × 3
#>   cluster     n  area
#>     <dbl> <int> <dbl>
#> 1       1     6    25
#> 2       2     5    10

另一種選擇是改變你的area.calc function 來分別接受 x 和 y 向量：

area.calc <- function(x, y){
  data  <-  data.frame(x, y)
  area  <-  chull(x, y) %>%
    c(., .[1]) %>%
    data[.,] %>%
    .[, 1:2] %>%
    sp::Polygon(., hole = F) %>%
    .@area
  return(area)
}

dat.clustered %>%
  group_by(cluster) %>%
  summarise(n = n(),
         area = area.calc(x, y))
#> # A tibble: 2 × 3
#>   cluster     n  area
#>     <dbl> <int> <dbl>
#> 1       1     6    25
#> 2       2     5    10

group_by() 之后的 summarize() 中自定義 function 的問題 - 所有組的結果都相同

問題描述

1 個解決方案

解決方案1
2 已采納 2022-11-19 00:39:37

group_by() 之后的 summarize() 中自定義 function 的問題 - 所有組的結果都相同

問題描述

1 個解決方案

解決方案1 2 已采納 2022-11-19 00:39:37

解決方案1
2 已采納 2022-11-19 00:39:37