简体   繁体   English

Apache Spark SQL:如何优化数据框的链式连接

[英]Apache Spark SQL: how to optimize chained join for dataframe

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation.我必须在一个主要数据框和几个参考数据框之间进行左连接,因此进行了链式连接计算。 And I wonder how to make this action efficient and scalable.我想知道如何使此操作高效且可扩展。

Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.方法一很好理解,也是现在的方法,但是我不满意,因为所有的transformation都已经链式等待最后的动作触发计算,如果我继续添加transformation和数据量, spark 最终会失败,因此这种方法不可扩展。

Method 1:方法一:

  def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {

  val transformations: List[DataFrame => DataFrame] = List(
    castColumnsFromStringToLong(ColumnsToCastToLong),
    castColumnsFromStringToFloat(ColumnsToCastToFloat),
    renameColumns(RenameMapping),
    filterAndDropColumns,
    joinRefDF1(refDF1),
    joinRefDF2(refDF2),
    joinRefDF3(refDF3),
    joinRefDF4(refDF4),
    joinRefDF5(refDF5),
    calculate()
  )

  transformations.reduce(_ andThen _)

  }

  pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)

Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.方法二:我还没有找到真正的方法来实现我的想法,但我希望立即触发每个连接的计算。

according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action .根据我的测试,count() 对 spark 来说太重了,对我的应用程序没用,但我不知道如何通过有效的操作触发连接计算。 This kind of action is, in fact, the answer to this question.这种行为,其实就是对这个问题的回答。

  val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
  joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
  joinedDF_1.count()  

  val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
  joinedDF_2.cache()
  joinedDF_2.count()

  val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
  joinedDF_3.cache()
  joinedDF_3.count()

  val joinedDF_4 = filterAndDropColumns(joinedDF_4)
  joinedDF_4.cache()
  joinedDF_4.count()

  ...

When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame .当您想在 Spark 中强制计算给定的join (或任何非最终的转换)时,您可以在DataFrame上使用简单的showcount This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.这种终端点将强制计算结果,否则根本不可能执行操作。

Only after this will your DataFrame be effectively stored in your cache.只有这样,您的DataFrame才会有效地存储在您的缓存中。

Once you're finished with a given DataFrame , don't hesitate to unpersist.一旦你完成了给定的DataFrame ,不要犹豫,不要坚持。 This will unpersist your data if your cluster need more room for further computation.如果您的集群需要更多空间来进行进一步计算,这将取消保留您的数据。

You need to repartitions your dataset with the columns before calling the join transformation.在调用连接转换之前,您需要使用列对数据集进行重新分区。

Example:例子:

df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)

Try creating a new dataframe based on it.尝试基于它创建一个新的数据框。 Ex: val dfTest = session.createDataFrame(df.rdd, df.schema).cache() dfTest.storageLevel.useMemory // result should be a true.例如:val dfTest = session.createDataFrame(df.rdd, df.schema).cache() dfTest.storageLevel.useMemory // 结果应该为真。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM