![](/img/trans.png)
[英]Sparklyr's spark_apply function seems to run on single executor and fails on moderately-large dataset
[英]Sparklyr spark_apply function on equal groups to run efficiently
如何在 sparklyr 環境中高效地分塊運行自定義函數?
我有一個 haversine 函數來計算 1 個數據幀內 2 組 Lat long 之間的距離。 可以想象,10 個客戶到 10 個商店位置將生成 100 行。 我有 1000 萬客戶和 500 家商店。 那是 50 億行。 運行這種規模的連接,然后隨后計算這個距離函數可能會造成壓力,如果不是讓我的 spark 環境崩潰的話。
我的想法是首先進行連接,這將生成 50 億行,然后分別以相等的塊 spark_apply 函數,然后將其追加回去。 要在 R 中本地執行此操作,我會執行一個 For 循環,並按塊順序重復處理此過程,然后將其追加回去。 我應該如何在 Spark 中執行此操作?
這是我到目前為止所得到的。 感謝您對基於 equal_groups 列運行 spark_apply 函數的語法的幫助。 假設這是有效運行它的最佳策略。
如果它與此功能有任何關系,我將無法將其應用於我的語法。 https://www.rstudio.com/blog/sparklyr-1-2/
library(tidyverse)
#sample data
df <- tibble(
place=c("Finland", "Canada", "Tanzania", "Bolivia", "France"),
longitude=c(27.472918, -90.476303, 34.679950, -65.691146, 4.533465),
latitude=c(63.293001, 54.239631, -2.855123, -13.795272, 48.603949))
from <-
df[1:3,] %>% #pick first 3 rows
rename(long1 = longitude,
lat1 = latitude)
to <- df[4:5,] %>% #pick last 2 rows
rename(long2 = longitude,
lat2 = latitude)
#increase data size
n <- 100
from_many <-
do.call("rbind", replicate(n, from, simplify = FALSE)) %>%
mutate(place = row_number())
library(sparklyr)
sc <- spark_connect(master = "local")
from_many_sf <- copy_to(sc, from_many,overwrite = TRUE)
to_sf <- copy_to(sc, to,overwrite = TRUE)
# --- haversine distance function ---
get_geodesic_distance = function(x){
geolocation = function(long1, lat1, long2, lat2){
deg2rad <- function(deg) return(deg*pi/180)
# Convert degrees to radians
long1 <- deg2rad(long1)
lat1 <- deg2rad(lat1)
long2 <- deg2rad(long2)
lat2 <- deg2rad(lat2)
R = 6378137 #6371 Mean radius of the earth in km # 6378137 meters
diff.long = (long2-long1)
diff.lat = (lat2-lat1)
a =(sin(diff.lat/2) * sin(diff.lat/2) + cos(lat1) * cos(lat2) * sin(diff.long/2)* sin(diff.long/2))
c= 2*atan2(sqrt(a),sqrt(1-a))
d = R*c
return(d) #Distance in km
}
dist_vec = geolocation(x$long1, x$lat1, x$long2, x$lat2)
res = dplyr::mutate(x, distance = dist_vec)
res
}
#expansive join
full_sf <-
from_many_sf %>%
full_join(to_sf, by = character())
full_sf %>% tally
# application of distance calculation
full_sf %>%
spark_apply(get_geodesic_distance)
#cut data into groups
full_sf %>%
sdf_with_sequential_id(., id = "id", from = 1L) %>%
mutate(equal_groups = ntile(id, 4)) %>%
group_by(equal_groups) %>%
tally()
spark_apply 的結果
# Source: spark<?> [?? x 7]
place_x long1 lat1 place_y long2 lat2 distance
<int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 27.5 63.3 Bolivia -65.7 -13.8 11545583.
2 1 27.5 63.3 France 4.53 48.6 2148220.
3 2 -90.5 54.2 Bolivia -65.7 -13.8 7929331.
4 2 -90.5 54.2 France 4.53 48.6 6111619.
5 3 34.7 -2.86 Bolivia -65.7 -13.8 11061340.
6 3 34.7 -2.86 France 4.53 48.6 6427714.
7 4 27.5 63.3 Bolivia -65.7 -13.8 11545583.
8 4 27.5 63.3 France 4.53 48.6 2148220.
9 5 -90.5 54.2 Bolivia -65.7 -13.8 7929331.
10 5 -90.5 54.2 France 4.53 48.6 6111619.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
至少正如我在上面看到的那樣,我建議你純粹在 Spark 中解決上面的問題。 這意味着:
foreach
包——Spark 將工作分配給你的工作人員。sparklyr::spark_apply()
您可以在每個工作節點中運行任意 R 代碼以運行任何計算...如果您已經熟悉 R,您可能會想對所有 Spark 操作使用這種方法; 然而,這不是 spark_apply() 的推薦用法。 前面的章節提供了更有效的技術和工具來解決眾所周知的問題精通 R 中的 Spark:第 11 章
在 Spark 中完成這一切是可能的,因為 Spark 理解dplyr
函數,並且在Hive UDF中有大量命令在 R 中具有完全相同的名稱(或幾乎相同),例如sin
, cos
等。
從本質上講,您可以讓 Spark 找出分配工作的困難部分(在用於制作較小工作塊的 for 循環中,如果有的話),只需從您定義的函數中復制和粘貼您的代碼:
library(magrittr) ##### if not already imported
full_sf <- full_sf %>%
dplyr::mutate(group = dplyr::ntile(place_x, n = 4))
for(grp in 1:4) {
full_sf %>%
dplyr::filter(group == grp) %>%
dplyr::relocate(place_y) %>%
dplyr::mutate(
dplyr::across(long1:lat2, ~ .x * pi/180),
diff.long = long2 - long1,
diff.lat = lat2 - lat1,
a = (sin(diff.lat/2) * sin(diff.lat/2) +
cos(lat1) * cos(lat2) * sin(diff.long/2)* sin(diff.long/2)),
c = 2*atan2(sqrt(a),sqrt(1-a))) %>%
dplyr::mutate(d = 6378137 * c) %>%
print()
}
# Source: spark<?> [?? x 12]
place_y place_x long1 lat1 long2 lat2 group diff.long diff.lat a c d
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bolivia 1 0.479 1.10 -1.15 -0.241 1 -1.63 -1.35 0.619 1.81 11545583.
2 France 1 0.479 1.10 0.0791 0.848 1 -0.400 -0.256 0.0281 0.337 2148220.
3 Bolivia 2 -1.58 0.947 -1.15 -0.241 1 0.433 -1.19 0.339 1.24 7929331.
4 France 2 -1.58 0.947 0.0791 0.848 1 1.66 -0.0984 0.213 0.958 6111619.
5 Bolivia 3 0.605 -0.0498 -1.15 -0.241 1 -1.75 -0.191 0.581 1.73 11061340.
6 France 3 0.605 -0.0498 0.0791 0.848 1 -0.526 0.898 0.233 1.01 6427714.
7 Bolivia 4 0.479 1.10 -1.15 -0.241 1 -1.63 -1.35 0.619 1.81 11545583.
8 France 4 0.479 1.10 0.0791 0.848 1 -0.400 -0.256 0.0281 0.337 2148220.
9 Bolivia 5 -1.58 0.947 -1.15 -0.241 1 0.433 -1.19 0.339 1.24 7929331.
10 France 5 -1.58 0.947 0.0791 0.848 1 1.66 -0.0984 0.213 0.958 6111619.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
# Source: spark<?> [?? x 12]
place_y place_x long1 lat1 long2 lat2 group diff.long diff.lat a c d
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France 8 -1.58 0.947 0.0791 0.848 2 1.66 -0.0984 0.213 0.958 6111619.
2 Bolivia 9 0.605 -0.0498 -1.15 -0.241 2 -1.75 -0.191 0.581 1.73 11061340.
3 France 9 0.605 -0.0498 0.0791 0.848 2 -0.526 0.898 0.233 1.01 6427714.
4 Bolivia 10 0.479 1.10 -1.15 -0.241 2 -1.63 -1.35 0.619 1.81 11545583.
5 France 10 0.479 1.10 0.0791 0.848 2 -0.400 -0.256 0.0281 0.337 2148220.
6 Bolivia 11 -1.58 0.947 -1.15 -0.241 2 0.433 -1.19 0.339 1.24 7929331.
7 France 11 -1.58 0.947 0.0791 0.848 2 1.66 -0.0984 0.213 0.958 6111619.
8 Bolivia 12 0.605 -0.0498 -1.15 -0.241 2 -1.75 -0.191 0.581 1.73 11061340.
9 France 12 0.605 -0.0498 0.0791 0.848 2 -0.526 0.898 0.233 1.01 6427714.
10 Bolivia 13 0.479 1.10 -1.15 -0.241 2 -1.63 -1.35 0.619 1.81 11545583.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
# Source: spark<?> [?? x 12]
place_y place_x long1 lat1 long2 lat2 group diff.long diff.lat a c d
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bolivia 16 0.479 1.10 -1.15 -0.241 3 -1.63 -1.35 0.619 1.81 11545583.
2 France 16 0.479 1.10 0.0791 0.848 3 -0.400 -0.256 0.0281 0.337 2148220.
3 Bolivia 17 -1.58 0.947 -1.15 -0.241 3 0.433 -1.19 0.339 1.24 7929331.
4 France 17 -1.58 0.947 0.0791 0.848 3 1.66 -0.0984 0.213 0.958 6111619.
5 Bolivia 18 0.605 -0.0498 -1.15 -0.241 3 -1.75 -0.191 0.581 1.73 11061340.
6 France 18 0.605 -0.0498 0.0791 0.848 3 -0.526 0.898 0.233 1.01 6427714.
7 Bolivia 19 0.479 1.10 -1.15 -0.241 3 -1.63 -1.35 0.619 1.81 11545583.
8 France 19 0.479 1.10 0.0791 0.848 3 -0.400 -0.256 0.0281 0.337 2148220.
9 Bolivia 20 -1.58 0.947 -1.15 -0.241 3 0.433 -1.19 0.339 1.24 7929331.
10 France 20 -1.58 0.947 0.0791 0.848 3 1.66 -0.0984 0.213 0.958 6111619.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
# Source: spark<?> [?? x 12]
place_y place_x long1 lat1 long2 lat2 group diff.long diff.lat a c d
<chr> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France 23 -1.58 0.947 0.0791 0.848 4 1.66 -0.0984 0.213 0.958 6111619.
2 Bolivia 24 0.605 -0.0498 -1.15 -0.241 4 -1.75 -0.191 0.581 1.73 11061340.
3 France 24 0.605 -0.0498 0.0791 0.848 4 -0.526 0.898 0.233 1.01 6427714.
4 Bolivia 25 0.479 1.10 -1.15 -0.241 4 -1.63 -1.35 0.619 1.81 11545583.
5 France 25 0.479 1.10 0.0791 0.848 4 -0.400 -0.256 0.0281 0.337 2148220.
6 Bolivia 26 -1.58 0.947 -1.15 -0.241 4 0.433 -1.19 0.339 1.24 7929331.
7 France 26 -1.58 0.947 0.0791 0.848 4 1.66 -0.0984 0.213 0.958 6111619.
8 Bolivia 27 0.605 -0.0498 -1.15 -0.241 4 -1.75 -0.191 0.581 1.73 11061340.
9 France 27 0.605 -0.0498 0.0791 0.848 4 -0.526 0.898 0.233 1.01 6427714.
10 Bolivia 28 0.479 1.10 -1.15 -0.241 4 -1.63 -1.35 0.619 1.81 11545583.
# … with more rows
# ℹ Use `print(n = ...)` to see more rows
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.