Sparklyr spark_apply 函數在相等的組上有效運行

Question

如何在 sparklyr 環境中高效地分塊運行自定義函數？

我有一個 haversine 函數來計算 1 個數據幀內 2 組 Lat long 之間的距離。 可以想象，10 個客戶到 10 個商店位置將生成 100 行。 我有 1000 萬客戶和 500 家商店。 那是 50 億行。 運行這種規模的連接，然后隨后計算這個距離函數可能會造成壓力，如果不是讓我的 spark 環境崩潰的話。

我的想法是首先進行連接，這將生成 50 億行，然后分別以相等的塊 spark_apply 函數，然后將其追加回去。 要在 R 中本地執行此操作，我會執行一個 For 循環，並按塊順序重復處理此過程，然后將其追加回去。 我應該如何在 Spark 中執行此操作？

這是我到目前為止所得到的。 感謝您對基於 equal_groups 列運行 spark_apply 函數的語法的幫助。 假設這是有效運行它的最佳策略。

如果它與此功能有任何關系，我將無法將其應用於我的語法。 https://www.rstudio.com/blog/sparklyr-1-2/

library(tidyverse)

#sample data
df <- tibble(
  place=c("Finland", "Canada", "Tanzania", "Bolivia", "France"),
  longitude=c(27.472918, -90.476303, 34.679950, -65.691146, 4.533465),
  latitude=c(63.293001, 54.239631, -2.855123, -13.795272, 48.603949))

from <- 
  df[1:3,] %>%  #pick first 3 rows
  rename(long1 = longitude,
         lat1 = latitude)
to <- df[4:5,] %>%  #pick last 2 rows
  rename(long2 = longitude,
         lat2 = latitude)

#increase data size
n <- 100
from_many <- 
  do.call("rbind", replicate(n, from, simplify = FALSE)) %>% 
  mutate(place = row_number())


library(sparklyr)
sc <- spark_connect(master = "local")

from_many_sf <- copy_to(sc, from_many,overwrite = TRUE)
to_sf <- copy_to(sc, to,overwrite = TRUE)

# --- haversine distance function ---
get_geodesic_distance = function(x){
  
  geolocation = function(long1, lat1, long2, lat2){
    
    deg2rad <- function(deg) return(deg*pi/180)
    
    # Convert degrees to radians
    long1 <- deg2rad(long1)
    lat1 <- deg2rad(lat1)
    long2 <- deg2rad(long2)
    lat2 <- deg2rad(lat2)
    
    R = 6378137 #6371 Mean radius of the earth in km # 6378137 meters
    
    diff.long = (long2-long1)
    diff.lat = (lat2-lat1)
    
    a =(sin(diff.lat/2) * sin(diff.lat/2) + cos(lat1) * cos(lat2) * sin(diff.long/2)* sin(diff.long/2))
    c= 2*atan2(sqrt(a),sqrt(1-a))
    d = R*c
    
    return(d) #Distance in km
  }
  
  dist_vec = geolocation(x$long1, x$lat1, x$long2, x$lat2)
  
  res = dplyr::mutate(x, distance = dist_vec)
  res
}

#expansive join
full_sf <- 
  from_many_sf %>% 
  full_join(to_sf, by = character())

full_sf %>% tally

# application of distance calculation
full_sf %>% 
  spark_apply(get_geodesic_distance)

#cut data into groups
  full_sf %>% 
    sdf_with_sequential_id(., id = "id", from = 1L) %>% 
    mutate(equal_groups = ntile(id, 4)) %>% 
    group_by(equal_groups) %>% 
    tally()

spark_apply 的結果

# Source: spark<?> [?? x 7]
   place_x long1  lat1 place_y  long2  lat2  distance
     <int> <dbl> <dbl> <chr>    <dbl> <dbl>     <dbl>
 1       1  27.5 63.3  Bolivia -65.7  -13.8 11545583.
 2       1  27.5 63.3  France    4.53  48.6  2148220.
 3       2 -90.5 54.2  Bolivia -65.7  -13.8  7929331.
 4       2 -90.5 54.2  France    4.53  48.6  6111619.
 5       3  34.7 -2.86 Bolivia -65.7  -13.8 11061340.
 6       3  34.7 -2.86 France    4.53  48.6  6427714.
 7       4  27.5 63.3  Bolivia -65.7  -13.8 11545583.
 8       4  27.5 63.3  France    4.53  48.6  2148220.
 9       5 -90.5 54.2  Bolivia -65.7  -13.8  7929331.
10       5 -90.5 54.2  France    4.53  48.6  6111619.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows

Answer 1

至少正如我在上面看到的那樣，我建議你純粹在 Spark 中解決上面的問題。 這意味着：

你不需要弄清楚你自己的並行化，例如，使用foreach包——Spark 將工作分配給你的工作人員。
你不需要弄清楚sparklyr::spark_apply()

您可以在每個工作節點中運行任意 R 代碼以運行任何計算...如果您已經熟悉 R，您可能會想對所有 Spark 操作使用這種方法； 然而，這不是 spark_apply() 的推薦用法。 前面的章節提供了更有效的技術和工具來解決眾所周知的問題精通 R 中的 Spark：第 11 章

在 Spark 中完成這一切是可能的，因為 Spark 理解dplyr函數，並且在Hive UDF中有大量命令在 R 中具有完全相同的名稱（或幾乎相同），例如sin ， cos等。

從本質上講，您可以讓 Spark 找出分配工作的困難部分（在用於制作較小工作塊的 for 循環中，如果有的話），只需從您定義的函數中復制和粘貼您的代碼：

library(magrittr) ##### if not already imported

full_sf <- full_sf %>% 
  dplyr::mutate(group = dplyr::ntile(place_x, n = 4))

for(grp in 1:4) {
  full_sf %>% 
    dplyr::filter(group == grp) %>%
    dplyr::relocate(place_y) %>% 
    dplyr::mutate(
      dplyr::across(long1:lat2, ~ .x * pi/180), 
      diff.long = long2 - long1, 
      diff.lat = lat2 - lat1, 
      a = (sin(diff.lat/2) * sin(diff.lat/2) + 
        cos(lat1) * cos(lat2) * sin(diff.long/2)* sin(diff.long/2)), 
      c = 2*atan2(sqrt(a),sqrt(1-a))) %>% 
    dplyr::mutate(d = 6378137 * c) %>% 
    print()
}

# Source: spark<?> [?? x 12]
   place_y place_x  long1    lat1   long2   lat2 group diff.long diff.lat      a     c         d
   <chr>     <int>  <dbl>   <dbl>   <dbl>  <dbl> <int>     <dbl>    <dbl>  <dbl> <dbl>     <dbl>
 1 Bolivia       1  0.479  1.10   -1.15   -0.241     1    -1.63   -1.35   0.619  1.81  11545583.
 2 France        1  0.479  1.10    0.0791  0.848     1    -0.400  -0.256  0.0281 0.337  2148220.
 3 Bolivia       2 -1.58   0.947  -1.15   -0.241     1     0.433  -1.19   0.339  1.24   7929331.
 4 France        2 -1.58   0.947   0.0791  0.848     1     1.66   -0.0984 0.213  0.958  6111619.
 5 Bolivia       3  0.605 -0.0498 -1.15   -0.241     1    -1.75   -0.191  0.581  1.73  11061340.
 6 France        3  0.605 -0.0498  0.0791  0.848     1    -0.526   0.898  0.233  1.01   6427714.
 7 Bolivia       4  0.479  1.10   -1.15   -0.241     1    -1.63   -1.35   0.619  1.81  11545583.
 8 France        4  0.479  1.10    0.0791  0.848     1    -0.400  -0.256  0.0281 0.337  2148220.
 9 Bolivia       5 -1.58   0.947  -1.15   -0.241     1     0.433  -1.19   0.339  1.24   7929331.
10 France        5 -1.58   0.947   0.0791  0.848     1     1.66   -0.0984 0.213  0.958  6111619.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
# Source: spark<?> [?? x 12]
   place_y place_x  long1    lat1   long2   lat2 group diff.long diff.lat      a     c         d
   <chr>     <int>  <dbl>   <dbl>   <dbl>  <dbl> <int>     <dbl>    <dbl>  <dbl> <dbl>     <dbl>
 1 France        8 -1.58   0.947   0.0791  0.848     2     1.66   -0.0984 0.213  0.958  6111619.
 2 Bolivia       9  0.605 -0.0498 -1.15   -0.241     2    -1.75   -0.191  0.581  1.73  11061340.
 3 France        9  0.605 -0.0498  0.0791  0.848     2    -0.526   0.898  0.233  1.01   6427714.
 4 Bolivia      10  0.479  1.10   -1.15   -0.241     2    -1.63   -1.35   0.619  1.81  11545583.
 5 France       10  0.479  1.10    0.0791  0.848     2    -0.400  -0.256  0.0281 0.337  2148220.
 6 Bolivia      11 -1.58   0.947  -1.15   -0.241     2     0.433  -1.19   0.339  1.24   7929331.
 7 France       11 -1.58   0.947   0.0791  0.848     2     1.66   -0.0984 0.213  0.958  6111619.
 8 Bolivia      12  0.605 -0.0498 -1.15   -0.241     2    -1.75   -0.191  0.581  1.73  11061340.
 9 France       12  0.605 -0.0498  0.0791  0.848     2    -0.526   0.898  0.233  1.01   6427714.
10 Bolivia      13  0.479  1.10   -1.15   -0.241     2    -1.63   -1.35   0.619  1.81  11545583.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
# Source: spark<?> [?? x 12]
   place_y place_x  long1    lat1   long2   lat2 group diff.long diff.lat      a     c         d
   <chr>     <int>  <dbl>   <dbl>   <dbl>  <dbl> <int>     <dbl>    <dbl>  <dbl> <dbl>     <dbl>
 1 Bolivia      16  0.479  1.10   -1.15   -0.241     3    -1.63   -1.35   0.619  1.81  11545583.
 2 France       16  0.479  1.10    0.0791  0.848     3    -0.400  -0.256  0.0281 0.337  2148220.
 3 Bolivia      17 -1.58   0.947  -1.15   -0.241     3     0.433  -1.19   0.339  1.24   7929331.
 4 France       17 -1.58   0.947   0.0791  0.848     3     1.66   -0.0984 0.213  0.958  6111619.
 5 Bolivia      18  0.605 -0.0498 -1.15   -0.241     3    -1.75   -0.191  0.581  1.73  11061340.
 6 France       18  0.605 -0.0498  0.0791  0.848     3    -0.526   0.898  0.233  1.01   6427714.
 7 Bolivia      19  0.479  1.10   -1.15   -0.241     3    -1.63   -1.35   0.619  1.81  11545583.
 8 France       19  0.479  1.10    0.0791  0.848     3    -0.400  -0.256  0.0281 0.337  2148220.
 9 Bolivia      20 -1.58   0.947  -1.15   -0.241     3     0.433  -1.19   0.339  1.24   7929331.
10 France       20 -1.58   0.947   0.0791  0.848     3     1.66   -0.0984 0.213  0.958  6111619.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
# Source: spark<?> [?? x 12]
   place_y place_x  long1    lat1   long2   lat2 group diff.long diff.lat      a     c         d
   <chr>     <int>  <dbl>   <dbl>   <dbl>  <dbl> <int>     <dbl>    <dbl>  <dbl> <dbl>     <dbl>
 1 France       23 -1.58   0.947   0.0791  0.848     4     1.66   -0.0984 0.213  0.958  6111619.
 2 Bolivia      24  0.605 -0.0498 -1.15   -0.241     4    -1.75   -0.191  0.581  1.73  11061340.
 3 France       24  0.605 -0.0498  0.0791  0.848     4    -0.526   0.898  0.233  1.01   6427714.
 4 Bolivia      25  0.479  1.10   -1.15   -0.241     4    -1.63   -1.35   0.619  1.81  11545583.
 5 France       25  0.479  1.10    0.0791  0.848     4    -0.400  -0.256  0.0281 0.337  2148220.
 6 Bolivia      26 -1.58   0.947  -1.15   -0.241     4     0.433  -1.19   0.339  1.24   7929331.
 7 France       26 -1.58   0.947   0.0791  0.848     4     1.66   -0.0984 0.213  0.958  6111619.
 8 Bolivia      27  0.605 -0.0498 -1.15   -0.241     4    -1.75   -0.191  0.581  1.73  11061340.
 9 France       27  0.605 -0.0498  0.0791  0.848     4    -0.526   0.898  0.233  1.01   6427714.
10 Bolivia      28  0.479  1.10   -1.15   -0.241     4    -1.63   -1.35   0.619  1.81  11545583.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows

Sparklyr spark_apply 函數在相等的組上有效運行

問題描述

1 個解決方案

解決方案1
1 已采納 2022-12-17 18:45:23

Sparklyr spark_apply 函數在相等的組上有效運行

問題描述

1 個解決方案

解決方案1 1 已采納 2022-12-17 18:45:23

解決方案1
1 已采納 2022-12-17 18:45:23