I have five Hive
tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. Also, Each table contains at least 100:600 columns all of them is Parquet
format.
Example of one table below:
CREATE TABLE table_a
(
customer_id Long,
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;
I need to achieve two points,
sortByKey
before join but still there is a performance bottleneck. I tried to reparation
by key before join but the performance is still not good. I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results. Sample of the join I tried below,
val dsA = spark.table(table_a)
val dsB = spark.table(table_b)
val dsC = spark.table(table_c)
val dsD = spark.table(table_d)
val dsE = spark.table(table_e)
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")
I think in this case the direct join is not the optimal case. You can acheive this task using the below simple way.
FeatureData
with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
groupByKey
and union
all the dataset with the same key. I the above way it will be faster to union than join. But it need more work.
After that, you will have a dataset with key,map. You will apply the transformation for key, Map(feature_name)
.
Simple example of the implementation as following: You will map first the dataset
to the case class
then you can union all of them. After that you will groupByKey
then map it and reduce it.
case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
FeatureMappedData(row.customer_id,
Map("featureA" -> row.featureA,
"featureB" -> row.featureB)))
val unionDataSet = dsAMapped union dsBMapped
unionDataSet.groupByKey(_.customer_id)
.mapGroups({
case (eid, featureIter) ⇒ {
val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0")
FeatureMappedData(customer_id, featuresMapped)
}
})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.