當“數據”是用戶定義函數的參數時，將用戶定義函數傳遞給“dplyr::summarize()”

Question

我正在嘗試為包含多個林分的數據集計算一個稱為最高高度的林業生物特征，每個林分都有許多地塊。 這種生物識別需要找到代表一塊地塊或林分中每英畝 40 棵樹的最大直徑樹木，計算它們代表的每英畝累積樹木及其累積高度，然后將累積高度除以每英畝累積樹木。 這需要我創建的用戶定義函數。 我的函數接受五個參數： data - 樹木生物特征數據的數據data.frame ， dbh - 代表直徑的列，單個樹木的胸高， ht - 代表單個樹木高度的列， tpa - 每英畝的樹木individual 表示， n - 計算中要考慮的每英畝樹木的數量，默認情況下為 40（經驗單位的森林生物識別標准值）。 作為我的用戶定義函數的一部分，我需要對圖中的樹進行排序或支持dbh的降序。 我正在嘗試使用dplyr:: group_by() %>% summarize()在每個地塊和展台組合上執行此功能。 但是，當我使用“管道”方法將數據從group_by()傳遞到summarize()函數時，數據沒有傳遞。 R 拋出以下錯誤：

Error in `summarize()`:
! Problem while computing `TOP_HT = topht(dbh = dbh, ht = ht, tpa =
  tpa, n = 40)`.
ℹ The error occurred in group 1: groups = "A".
Caused by error:
! argument "data" is missing, with no default
Run `rlang::last_error()` to see where the error occurred.

顯而易見的答案就是簡單地取出數據參數並僅在樹生物特征參數上定義函數。 但是，這不起作用，因為我需要按dbh的降序對所有變量進行排序。 有沒有一種方法可以將分組數據傳遞給summarize()調用中的data參數？ 下面是我使用假數據的可重現示例：

##Loading Necessary Package##
library(dplyr)

##Setting Random Number Seed for Reproducibility##
set.seed(55)

##Generating Some Fake Data## 
groups<-c(rep("A", 5), rep("B", 5))
ht<-rnorm(10, 125, 20)
tpa<-rnorm(10, 150, 60)
dbh<-rnorm(10, 20, 2)
DF<-data.frame(groups=groups, dbh=dbh, ht=ht, tpa=tpa)

##Defining the topht function##
topht<-function(data, dbh=NULL, ht=NULL, tpa=NULL, n=40){ #function parameters
  
  ##evaluate function parameters in the data environment
  tmp<-eval(substitute(dbh), envir = data)
  odata<-data[base::order(tmp, decreasing=TRUE),]
  ht<-eval(substitute(ht), envir=odata)
  tpa<-eval(substitute(tpa), envir=odata)
  
  #creating variables for cumulative trees per acre and cumulative height calculations#
  cumtpa<-0
  cumht<-0
  
  #beginning a loop to calculate top height#
  for(i in 1:nrow(odata)){#setting looping range
    if(cumtpa < n){ #only run cumulative adding when cumulative trees per acre is less than n
      cumtpa<-tpa[i]+cumtpa
      cumht<-(ht[i]*tpa[i])+cumht
    }#Close conditional
    if(cumtpa==n){#End the loop if cumulative tpa = n
      break
    }#End Conditional
    if(cumtpa > n){#Adjust final tree's weight when cumulative tpa exceeds n and end loop
      delta <- cumtpa - n
      cumtpa<-cumtpa-delta
      cumht<-cumht-(delta*ht[i])
      break
    }#End Conditional
    if(cumtpa>0){#Define calculation of top height when trees per acre > 0
      topht<-cumht/cumtpa
    }else{#Define complement of conditional
      topht<-0
    }#Close conditional
  }#Close loop
  return(topht)#Output top height
}#Close function

##Attempting to run top height function independently for groups A and B##
out<-as.data.frame(DF %>% group_by(groups) %>% summarize(TOP_HT=topht(dbh=dbh,ht=ht,tpa=tpa,n=40)))#Throws error

Answer 1

我試圖修復您的功能並將其應用於您的數據：

library(dplyr)

topht <- function(data, dbh = NULL, ht = NULL, tpa = NULL, n = 40){ 
  
  ##evaluate function parameters in the data environment
  tmp <- data %>% pull({{ dbh }})
  odata <- data[base::order(tmp, decreasing=TRUE),]
  ht <- odata %>% pull({{ ht }})
  tpa <- data %>% pull({{ tpa }})
  
  #creating variables for cumulative trees per acre and cumulative height calculations#
  cumtpa <- 0
  cumht <- 0
  outcome <- 0
  
  for(i in 1:nrow(odata)) {
    
    if(cumtpa < n){ 
      
      cumtpa <- tpa[i] + cumtpa
      cumht <- (ht[i] * tpa[i]) + cumht
      
    } else if(cumtpa == n){
      
      break
      
    } else  {
      
      delta <- cumtpa - n
      cumtpa <- cumtpa - delta
      cumht <- cumht - (delta*ht[i])
      break
      
    }
    
    if(cumtpa > 0) {
      
      outcome <- cumht / cumtpa
      
    } else {
      
      outcome <- 0
      
    }
    
  }   
  
  outcome
}

現在我們將此函數應用於每個組：

DF %>% 
  group_by(groups) %>% 
  group_modify(~ .x %>% summarize(TOP_HT = topht(., dbh = dbh, ht = ht, tpa = tpa, n = 40))) %>% 
  ungroup() %>% 
  as.data.frame()

我們想對每個組應用topht ，所以我們使用group_modify （它就像purrr的小妹妹）。 這返回

  groups    TOP_HT
1      A  88.75246
2      B 123.01531

幾句解釋：

因為你的函數被命名為topht ，你真的不應該使用topht作為變量名（即使在這個函數內部）。 我把它改成了outcome 。
outcome應該用一些值定義/初始化。 我選擇了0 ， NA或其他可能也是可能的。
函數末尾的return()是不必要的。 只需使用變量名。
要評估函數的參數（如dbh = dbh ），您需要 curly-curly 運算符。 作為參考： https ://www.r-bloggers.com/2019/06/curly-curly-the-successor-of-bang-bang/
你的第一個if - 結構應該打包成一個if-else if - else結構。
為了提高可讀性，您可以使用一些間距（請參閱http://adv-r.had.co.nz/Style.html ）。

當“數據”是用戶定義函數的參數時，將用戶定義函數傳遞給“dplyr::summarize()”

問題描述

1 個解決方案

解決方案1
1 已采納 2022-12-23 22:08:11

當“數據”是用戶定義函數的參數時，將用戶定義函數傳遞給“dplyr::summarize()”

問題描述

1 個解決方案

解決方案1 1 已采納 2022-12-23 22:08:11

解決方案1
1 已采納 2022-12-23 22:08:11