简体   繁体   English

Spark增强数据集之间的TB级连接

[英]Spark Enhance Join between Terabytes of Datasets

I have five Hive tables assume the names is A, B, C, D, and E. For each table there is a customer_id as the key for join between them. 我有五个Hive表,假设名称分别为A,B,C,D和E。对于每个表,都有一个customer_id作为它们之间连接的键。 Also, Each table contains at least 100:600 columns all of them is Parquet format. 另外,每个表至少包含100:600列,所有列均为Parquet格式。

Example of one table below: 下表的示例:

CREATE TABLE table_a 
(
customer_id Long, 
col_1 STRING,
col_2 STRING,
col_3 STRING,
.
.
col_600 STRING
)
STORED AS PARQUET;

I need to achieve two points, 我需要达到两点,

  • Join all of them together with the most optimum way using Spark Scala. 使用Spark Scala,以最佳方式将所有这些对象结合在一起。 I tried to sortByKey before join but still there is a performance bottleneck. 我在加入之前尝试对sortByKey进行sortByKey ,但是仍然存在性能瓶颈。 I tried to reparation by key before join but the performance is still not good. 我在加入之前尝试通过键进行reparation ,但是性能仍然不佳。 I tried to increase the parallelism for Spark to make it 6000 with many executors but not able to achieve a good results. 我试图提高Spark的并行度,使其具有许多执行程序,使其达到6000,但无法获得良好的结果。
  • After join I need to apply a separate function for some of these columns. 加入后,我需要为其中一些列应用单独的功能。

Sample of the join I tried below, 我在下面尝试的联接示例,

val dsA =  spark.table(table_a)
val dsB =  spark.table(table_b) 
val dsC =  spark.table(table_c) 
val dsD =  spark.table(table_d) 
val dsE =  spark.table(table_e) 
val dsAJoineddsB = dsA.join(dsB,Seq(customer_id),"inner")

I think in this case the direct join is not the optimal case. 我认为在这种情况下,直接联接不是最佳情况。 You can acheive this task using the below simple way. 您可以使用以下简单方法来完成此任务。

  • First, create case class for example FeatureData with two fields case class FeatureData(customer_id:Long, featureValue:Map[String,String]) 首先,创建带有两个字段的case class FeatureData(customer_id:Long, featureValue:Map[String,String])例如FeatureDatacase class FeatureData(customer_id:Long, featureValue:Map[String,String])
  • Second, You will map each table to FeatureData case class key, [feature_name,feature_value] 其次,您将每个表映射到FeatureData案例类键[feature_name,feature_value]
  • Third, You will groupByKey and union all the dataset with the same key. 第三,您将对groupByKeyunion用相同的密钥合并所有数据集。

I the above way it will be faster to union than join. 通过以上方法,合并比加入要快。 But it need more work. 但是它需要更多的工作。

After that, you will have a dataset with key,map. 之后,您将拥有一个包含键映射的数据集。 You will apply the transformation for key, Map(feature_name) . 您将对key, Map(feature_name)应用转换。

Simple example of the implementation as following: You will map first the dataset to the case class then you can union all of them. 实施的简单示例如下:首先将dataset映射到case class然后可以将它们全部合并。 After that you will groupByKey then map it and reduce it. 之后,您将对groupByKey进行映射并缩小它。

case class FeatureMappedData(customer_id:Long, feature: Map[String, String])
val dsAMapped = dsA.map(row ⇒
        FeatureMappedData(row.customer_id,
          Map("featureA" -> row.featureA,
            "featureB" -> row.featureB)))
val unionDataSet = dsAMapped  union dsBMapped 
unionDataSet.groupByKey(_.customer_id)
      .mapGroups({
        case (eid, featureIter) ⇒ {
      val featuresMapped: Map[String, String] = featureIter.map(_.feature).reduce(_ ++ _).withDefaultValue("0") 
      FeatureMappedData(customer_id, featuresMapped)
    }
  })

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM