簡體 English 中英

Spark 如何僅在分區內加入

[英]Spark How to Join Only Within Partitions

原文 2020-10-02 19:10:08 0 1 apache-spark/ apache-spark-sql/ partitioning

我有 2 個大數據框。 每行都有緯度/經度數據。 我的目標是在 2 個數據幀之間進行連接並找到距離內的所有點，例如 100m。

df1: (id, lat, lon, geohash7)
df2: (id, lat, lon, geohash7)

我想在 geohash7 上對 df1 和 df2 進行分區，然后只在分區內加入。 我想避免在分區之間加入以減少計算。

df1 = df1.repartition(200, "geohash7")
df2 = df2.repartition(200, "geohash7")

df_merged = df1.join(df2, (df1("geohash7")===df2("geohash7")) & (dist(df1("lat"),df1("lon"),df2("lat"),df2("lon"))<100) )

所以基本上在geohash7上加入，然后確保點之間的距離小於100。問題是，Spark實際上會交叉加入所有數據。 我怎樣才能讓它只做分區間連接而不是分區內連接？

1 個解決方案

在大量使用數據之后，似乎 spark 足夠聰明，可以首先確保在相等條件（“geohash7”）上發生連接。 所以如果那里沒有匹配項，它就不會計算“dist”函數。 似乎在相等條件下，它不再進行交叉連接。 所以我不需要做任何其他事情。 上面的連接工作正常。

如何僅在Spark Streaming中的分區中“減少”（也許使用CombineByKey）？

[英]How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

Spark中內部連接的分區數是如何計算的？

[英]How is the number of partitions for an inner join calculated in Spark?

Spark Structured Streaming 生產者是在 Spark 分區之間還是僅在分區內使用 Kafka 默認分區器？

[英]Does Spark Structured Streaming producer using the Kafka default partitioner between Spark partitions or only within partition?

在PySpark的分區中加入DataFrames

[英]join DataFrames within partitions in PySpark

SPARK：僅在每個分區中刪除重復項

[英]SPARK: dropDuplicates in every partitions only

Spark中如何創建分區

[英]How partitions are created in Spark

加入 Spark SQL 中的分區以獲得更好的性能

[英]Join Partitions in Spark SQL for better performance

Spark中的任務是如何分配分區的

[英]How are partitions assigned to tasks in Spark

如何在Spark RDD中創建分區

[英]How partitions are created in spark RDD

spark如何將分區分配給executor

[英]How spark distributes partitions to executors

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 如何僅在Spark Streaming中的分區中“減少”（也許使用CombineByKey）？ Spark中內部連接的分區數是如何計算的？ Spark Structured Streaming 生產者是在 Spark 分區之間還是僅在分區內使用 Kafka 默認分區器？在PySpark的分區中加入DataFrames SPARK：僅在每個分區中刪除重復項 Spark中如何創建分區加入 Spark SQL 中的分區以獲得更好的性能 Spark中的任務是如何分配分區的如何在Spark RDD中創建分區 spark如何將分區分配給executor

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM