Spark增强数据集之间的TB级连接

Question

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. 我有五个Hive表，假设名称分别为A，B，C，D和E。对于每个表，都有一个customer_id作为它们之间连接的键。 Also, Each table contains at least 100:600 columns all of them is Parquet format. 另外，每个表至少包含100：600列，所有列均为Parquet格式。

Example of one table below: 下表的示例：

CREATE TABLE table_a 
(
customer_id Long, 
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;

I need to achieve two points, 我需要达到两点，

Join all of them together with the most optimum way using Spark Scala. 使用Spark Scala，以最佳方式将所有这些对象结合在一起。 I tried to sortByKey before join but still there is a performance bottleneck. 我在加入之前尝试对sortByKey进行sortByKey ，但是仍然存在性能瓶颈。 I tried to reparation by key before join but the performance is still not good. 我在加入之前尝试通过键进行reparation ，但是性能仍然不佳。 I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results. 我试图提高Spark的并行度，使其具有许多执行程序，使其达到6000，但无法获得良好的结果。
After join I need to apply a separate function for some of these columns. 加入后，我需要为其中一些列应用单独的功能。

Sample of the join I tried below, 我在下面尝试的联接示例，

val dsA =  spark.table(table_a)
val dsB =  spark.table(table_b) 
val dsC =  spark.table(table_c) 
val dsD =  spark.table(table_d) 
val dsE =  spark.table(table_e) 
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")

Answer 1

I think in this case the direct join is not the optimal case. 我认为在这种情况下，直接联接不是最佳情况。 You can acheive this task using the below simple way. 您可以使用以下简单方法来完成此任务。

First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String]) 首先，创建带有两个字段的case class FeatureData(customer_id:Long, featureValue:Map[String,String])例如FeatureData ， case class FeatureData(customer_id:Long, featureValue:Map[String,String])
Second, You will map each table to FeatureData case class key, [feature_name,feature_value] 其次，您将每个表映射到FeatureData案例类键[feature_name，feature_value]
Third, You will groupByKey and union all the dataset with the same key. 第三，您将对groupByKey ， union用相同的密钥合并所有数据集。

I the above way it will be faster to union than join. 通过以上方法，合并比加入要快。 But it need more work. 但是它需要更多的工作。

After that, you will have a dataset with key,map. 之后，您将拥有一个包含键映射的数据集。 You will apply the transformation for key, Map(feature_name) . 您将对key, Map(feature_name)应用转换。

Simple example of the implementation as following: You will map first the dataset to the case class then you can union all of them. 实施的简单示例如下：首先将dataset映射到case class然后可以将它们全部合并。 After that you will groupByKey then map it and reduce it. 之后，您将对groupByKey进行映射并缩小它。

case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
        FeatureMappedData(row.customer_id,
          Map("featureA" -> row.featureA,
            "featureB" -> row.featureB)))
val unionDataSet = dsAMapped  union dsBMapped 
unionDataSet.groupByKey(_.customer_id)
      .mapGroups({
        case (eid, featureIter) ⇒ {
      val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0") 
      FeatureMappedData(customer_id, featuresMapped)
    }
  })

Spark增强数据集之间的TB级连接

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-05-19 13:04:43

Spark增强数据集之间的TB级连接

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-05-19 13:04:43

解决方案1
0 已采纳 2019-05-19 13:04:43