[英]Apply function on a cross-join between 2 dataframes Scala Spark
I have 2 dataframe in Spark (Scala) like this :我在 Spark(Scala)中有 2 个数据框,如下所示:
Customers :顾客 :
+--------+-----------+----------------+-----------+---------------+---------------+
|id |postal_code|city_name |valeurPrise|latitudeOK |longitudeOK |
+--------+-----------+----------------+-----------+---------------+---------------+
|22318764|94200 |Ivry-sur-Seine |Number |48.815679000000|2.393150000000 |
|983026 |39330 |Mouchard |Street |46.978240000000|5.807290000000 |
|810029 |33260 |La Teste-de-Buch|Street |44.539033000000|-1.152371000000|
|1880521 |77360 |Vaires-sur-Marne|Street |48.877451000000|2.649342000000 |
|19502247|80090 |Amiens |Number |49.871260000000|2.300264000000 |
|17550309|72100 |Le Mans |Number |47.973960000000|0.206240000000 |
|22311804|94250 |Gentilly |Number |48.816344000000|2.340399000000 |
|284138 |14000 |Caen |Street |49.186034000000|-0.353779000000|
|2011904 |83000 |Toulon |Street |43.125340000000|5.930290000000 |
|21922785|92110 |Clichy |Number |48.910761000000|2.307201000000 |
+--------+-----------+----------------+-----------+---------------+---------------+
Shop :店铺 :
+------+-----------+----------------+---------------+------+
|erd_cd|ville |gps_wgs84_lat |gps_wgs84_lon |active|
+------+-----------+----------------+---------------+------+
|31312 |MAMOUDZOU |-12.780550000000|45.227770000000|VRAI |
|31901 |ST JOSEPH |-21.376620000000|55.616100000000|VRAI |
|31307 |STE MARIE |-20.899934381104|55.517562110882|VRAI |
|31303 |ST BENOIT |-21.043730000000|55.717850000000|VRAI |
|31302 |ST PIERRE |-21.340676722653|55.477203422331|VRAI |
|35023 |STE SUZANNE|-20.929250000000|55.633290000000|VRAI |
|31305 |ST DENIS |-20.880840000000|55.450700000000|VRAI |
|31304 |LE PORT |-20.956710000000|55.308050000000|VRAI |
|32530 |ST PAUL |-21.008640000000|55.271290000000|VRAI |
|19585 |BEAUNE |47.023000000000 |4.837550000000 |VRAI |
+------+-----------+----------------+---------------+------+
The first contains 19 000 000 of rows and the second contains 650 rows.第一个包含 19 000 000 行,第二个包含 650 行。
I want to calculate de distance for each customer with each shop and stock the result in a new column in the customer's dataframe.我想计算每个客户与每个商店的距离,并将结果存储在客户数据框中的新列中。
Like for instance [23, 47, 125, 8, ...] for the first customer,...例如,第一个客户的 [23, 47, 125, 8, ...],...
Ideally, I want too keep the "erd_cd" too.理想情况下,我也想保留“erd_cd”。
So a tupple is perhaps a good solution.所以元组可能是一个很好的解决方案。 For instance [31312:23, 27654:47,...] will be great.例如 [31312:23, 27654:47,...] 会很棒。
I know the formula for compute the distance, don't care about this.我知道计算距离的公式,不要在意这个。
My question is "How can I simulate a cross-join and apply a function" ?我的问题是“如何模拟交叉连接并应用函数”?
I think about a cross-join but it'll generate 19 000 000 000 of rows (perhaps it's a little too much).我考虑过交叉连接,但它会生成 19 000 000 000 行(可能有点太多了)。
Doo you have any idea ?你有什么想法吗?
Thank you very much.非常感谢。
Shop data can be broadcasted as a Map/Set/Seq and used for processing Customer data.商店数据可以作为 Map/Set/Seq 广播并用于处理客户数据。 It will be a map operation which can run extremely parallel.这将是一个可以极其并行运行的映射操作。
val shop = //shop data in Map() or Seq() format, whatever suits your need
val shopB = spark.sparkContext.broadcast(shop).value
val customer = //build the dataframe or dataset
customer.map{ c =>
val distance = aFunction(c, shopB)
(c.id, c.postal_code,..... distance)
}.toDF(<column names>)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.