Spark Enhance Join between Terabytes of Datasets

Question

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. Also, Each table contains at least 100:600 columns all of them is Parquet format.

Example of one table below:

CREATE TABLE table_a 
(
customer_id Long, 
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;

I need to achieve two points,

Join all of them together with the most optimum way using Spark Scala. I tried to sortByKey before join but still there is a performance bottleneck. I tried to reparation by key before join but the performance is still not good. I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results.
After join I need to apply a separate function for some of these columns.

Sample of the join I tried below,

val dsA =  spark.table(table_a)
val dsB =  spark.table(table_b) 
val dsC =  spark.table(table_c) 
val dsD =  spark.table(table_d) 
val dsE =  spark.table(table_e) 
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")

Answer 1

I think in this case the direct join is not the optimal case. You can acheive this task using the below simple way.

First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String])
Second, You will map each table to FeatureData case class key, [feature_name,feature_value]
Third, You will groupByKey and union all the dataset with the same key.

I the above way it will be faster to union than join. But it need more work.

After that, you will have a dataset with key,map. You will apply the transformation for key, Map(feature_name) .

Simple example of the implementation as following: You will map first the dataset to the case class then you can union all of them. After that you will groupByKey then map it and reduce it.

case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
        FeatureMappedData(row.customer_id,
          Map("featureA" -> row.featureA,
            "featureB" -> row.featureB)))
val unionDataSet = dsAMapped  union dsBMapped 
unionDataSet.groupByKey(_.customer_id)
      .mapGroups({
        case (eid, featureIter) ⇒ {
      val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0") 
      FeatureMappedData(customer_id, featuresMapped)
    }
  })

Spark Enhance Join between Terabytes of Datasets

Question

1 answers

solution1
0 ACCPTED 2019-05-19 13:04:43

Spark Enhance Join between Terabytes of Datasets

Question

1 answers

solution1 0 ACCPTED 2019-05-19 13:04:43

solution1
0 ACCPTED 2019-05-19 13:04:43