[英]Spark Enhance Join between Terabytes of Datasets
I have five Hive
tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. 我有五个
Hive
表,假设名称分别为A,B,C,D和E。对于每个表,都有一个customer_id作为它们之间连接的键。 Also, Each table contains at least 100:600 columns all of them is Parquet
format. 另外,每个表至少包含100:600列,所有列均为
Parquet
格式。
Example of one table below: 下表的示例:
CREATE TABLE table_a
(
customer_id Long,
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;
I need to achieve two points, 我需要达到两点,
sortByKey
before join but still there is a performance bottleneck. sortByKey
进行sortByKey
,但是仍然存在性能瓶颈。 I tried to reparation
by key before join but the performance is still not good. reparation
,但是性能仍然不佳。 I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results. Sample of the join I tried below, 我在下面尝试的联接示例,
val dsA = spark.table(table_a)
val dsB = spark.table(table_b)
val dsC = spark.table(table_c)
val dsD = spark.table(table_d)
val dsE = spark.table(table_e)
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")
I think in this case the direct join is not the optimal case. 我认为在这种情况下,直接联接不是最佳情况。 You can acheive this task using the below simple way.
您可以使用以下简单方法来完成此任务。
FeatureData
with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
case class FeatureData(customer_id:Long, featureValue:Map[String,String])
例如FeatureData
, case class FeatureData(customer_id:Long, featureValue:Map[String,String])
groupByKey
and union
all the dataset with the same key. groupByKey
, union
用相同的密钥合并所有数据集。 I the above way it will be faster to union than join. 通过以上方法,合并比加入要快。 But it need more work.
但是它需要更多的工作。
After that, you will have a dataset with key,map. 之后,您将拥有一个包含键映射的数据集。 You will apply the transformation for
key, Map(feature_name)
. 您将对
key, Map(feature_name)
应用转换。
Simple example of the implementation as following: You will map first the dataset
to the case class
then you can union all of them. 实施的简单示例如下:首先将
dataset
映射到case class
然后可以将它们全部合并。 After that you will groupByKey
then map it and reduce it. 之后,您将对
groupByKey
进行映射并缩小它。
case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
FeatureMappedData(row.customer_id,
Map("featureA" -> row.featureA,
"featureB" -> row.featureB)))
val unionDataSet = dsAMapped union dsBMapped
unionDataSet.groupByKey(_.customer_id)
.mapGroups({
case (eid, featureIter) ⇒ {
val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0")
FeatureMappedData(customer_id, featuresMapped)
}
})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.