R 如何對依賴於其他觀察的 function 進行矢量化

Question

嗨，我有一個數據集如下：

set.seed(100)
library(microbenchmark)
City=c("City1","City2","City2","City1","City2","City1","City2","City1")
Business=c("B","A","B","A","C","A","E","F")
SomeNumber=c(35,20,15,19,12,40,36,28)
zz=data.frame(City,Business,SomeNumber)
zz_new=do.call("rbind", replicate(1000,zz, simplify = FALSE))
zz_new$BusinessMax=0 #Initializing final variable of interest at 0

我只是將 dataframe zz 的行復制 1000 次，以便稍后測量性能。

我還有一個自定義的 function 如下：

City1=function(full_data,observation){
  NewSet=full_data[which(full_data$City==observation$City & !full_data$Business==observation$Business),]
  NewSet2=max(NewSet$SomeNumber)
  return(NewSet2)
}

我想做的是將自定義 function 僅應用於 City==City1 的 zz_new 行。 我可以創建一個邏輯 object i1 存儲特定行是否滿足條件，如下所示：

i1 <- zz_new[["City"]] == "City1"

接下來，這是我需要幫助的地方，我編寫了一個 for 循環（占用了這么長時間），如下所示：

for (i in 1:nrow(zz_new[i1,])){
  zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
}
zz_new[i1,]

上面的代碼提供了正確的答案。 但是，它非常緩慢且效率低下。 我運行微基准並獲得：

microbenchmark(
for (i in 1:nrow(zz_new[i1,])){
  zz_new[i1,][i,"BusinessMax"]=City1(full_data=zz_new, observation = zz_new[i1,][i,])
},times = 5)

      min       lq     mean   median       uq     max neval
 4.369269 4.400759 4.433388 4.401734 4.450246 4.54493     5

我應該如何 go 關於矢量化 function City1？ 在我的實際代碼中，我需要在 function City1 中進行多個條件檢查（這里我剛剛使用了兩個列 City 和 Business 來對數據進行子集化，但我需要包含其他幾個變量）。 SO 上的許多矢量化代碼僅使用來自給定行的信息。 不幸的是，就我而言，我需要結合給定行和數據集的信息。 任何幫助將不勝感激。 提前致謝。

編輯1：

City1 功能說明：

首先，它創建一個子集，以保留那些觀測值，其中提供的觀測值的“城市”與數據集的城市相同。 從這個子集中，它會刪除那些觀察的“業務”與數據的“業務”相同的觀察。 例如。 如果提供的觀察的“城市”和“商業”分別是 City1 和 A，那么子集將只考慮那些具有 City == City1 和 Business 不等於 A 的觀察。

我還需要為其他城市創建其他類似的功能。 但是如果有人可以幫助我對 City1 進行矢量化，我可以嘗試對其他功能做同樣的事情。

編輯2：

例如，我為 City == City2 編寫了一個備用 function，如下所示：

City2=function(full_data,observation){
      NewSet=full_data[which(full_data$City==observation$City & full_data$Business==observation$Business),]
      NewSet2=max(NewSet$SomeNumber)-(10*rnorm(1))
      return(NewSet2)
    }

在上面的 function 中，請注意，與 City1 相比，我刪除了“。” 來自 NewSet 的符號並從值 NewSet2 中減去 (-10*rnorm)。

接下來，我僅針對 City == City2 的觀察結果運行它。

i2 <- zz_new[["City"]] == "City2"

for (i in 1:nrow(zz_new[i2,])){
  zz_new[i2,][i,"BusinessMax"]=City2(full_data=zz_new, observation = zz_new[i2,][i,])
}

Answer 1

這是一個快速版本，可以完成您的City1() for循環所做的事情。 看起來你想在每個城市都這樣做，所以我這樣做了。

library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)

# calculate the max for each business, by city, in City1 only
biz_max = zzdt[, .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max = 
  biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
    Business != i.Business,
    .(BusinessMax = max(i.BusinessMax)),
    by = .(City, Business)
  ]
# join back to the original data
result = zzdt[other_biz_max]

如果我們只想將此應用於City == "City1" ，我們可以在第一步中進行過濾並使最終連接成為完全連接 - rest 保持不變。

library(data.table)
# convert to data table and set key for speed
zzdt = as.data.table(zz_new)
setkey(zzdt, City, Business)

# calculate the max for each business in City1
biz_max = zzdt[City == "City1", .(BusinessMax = max(SomeNumber)), by = .(City, Business)]
# self-join the max values and filter out where the business match
# to get the max of other businesses within the same city
other_biz_max = 
  biz_max[biz_max, on = .(City), allow.cartesian = TRUE][
    Business != i.Business,
    .(BusinessMax = max(i.BusinessMax)),
    by = .(City, Business)
  ]
# join back to the original data
result = merge(zzdt, other_biz_max, by = c("City", "Business"), all = TRUE)

在我的電腦上， data.table方法需要 0.03 秒，而你問題中的方法需要 10.28 秒，加速大約 300 倍。 我當時包括了 data.table 轉換和密鑰設置，但是如果您使用 data.table 並使用該密鑰，您的代碼的 rest 也可以加快速度。

R 如何對依賴於其他觀察的 function 進行矢量化

問題描述

1 個解決方案

解決方案1
2 已采納 2020-06-24 16:56:40

R 如何對依賴於其他觀察的 function 進行矢量化

問題描述

1 個解決方案

解決方案1 2 已采納 2020-06-24 16:56:40

解決方案1
2 已采納 2020-06-24 16:56:40