[英]Spark dataframe join with range slow
I have the following input data (in Parquet) for a spark job: 我有一个火花作业的以下输入数据(在Parquet中):
Person (millions of rows)
+---------+----------+---------------+---------------+
| name | location | start | end |
+---------+----------+---------------+---------------+
| Person1 | 1230 | 1478630000001 | 1478630000010 |
| Person2 | 1230 | 1478630000002 | 1478630000012 |
| Person2 | 1230 | 1478630000013 | 1478630000020 |
| Person3 | 3450 | 1478630000001 | 1478630000015 |
+---------+----------+---------------+---------------+
Event (millions of rows)
+----------+----------+---------------+
| event | location | start_time |
+----------+----------+---------------+
| Biking | 1230 | 1478630000005 |
| Skating | 1230 | 1478630000014 |
| Baseball | 3450 | 1478630000015 |
+----------+----------+---------------+
and I need to to transform it into the following expected outcome: 我需要将其转换为以下预期结果:
[{
"name" : "Biking",
"persons" : ["Person1", "Person2"]
},
{
"name" : "Skating",
"persons" : ["Person2"]
},
{
"name" : "Baseball",
"persons" : ["Person3"]
}]
In words: the result is a list of each event each with a list of the persons which participated in this event. 用文字表示:结果是每个事件的列表,每个事件都包含参与此事件的人员列表。
A person counts as participant if 如果是,一个人算作参与者
Person.start < Event.start_time
&& Person.end > Event.start_time
&& Person.location == Event.location
I have tried different approaches, but the only one which actually seems to work is to join the two dataframes and then group/aggregate them by event. 我尝试了不同的方法,但实际上似乎唯一有用的方法是加入两个数据帧,然后按事件分组/聚合它们。 But the join is extremely slow and does not distribute well across multiple CPU cores.
但是连接速度非常慢,并且不能很好地分布在多个CPU核心上。
Current code for the Join: 加入的当前代码:
final DataFrame fullFrame = persons.as("persons")
.join(events.as("events"), col("persons.location").equalTo(col("events.location"))
.and(col("events.start_time").geq(col("persons.start")))
.and(col("events.start_time").leq(col("persons.end"))), "inner");
//count to have an action
fullFrame.count();
I am using Spark Standalone and Java, if this makes a difference. 我正在使用Spark Standalone和Java,如果这有所不同的话。
Does anybody have a better idea how to solve this problem with Spark 1.6.2 ? 有没有人更好地了解如何使用Spark 1.6.2解决这个问题?
Range joins are performed as a crossproduct with a subsequent filter step. 范围连接作为交叉产品执行,具有后续过滤步骤。 A potentially better solution could be, to broadcast the potentially smaller
events
table and then map the persons
table: inside the map, check for the join condition and produce the respective result. 一个可能更好的解决方案可能是, 广播可能较小的
events
表,然后映射persons
表:在地图内,检查连接条件并产生相应的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.