简体   繁体   中英

Apache Spark SQL: how to optimize chained join for dataframe

I have to make a left join between a principle data frame and several reference data frame, so a chained join computation. And I wonder how to make this action efficient and scalable.

Method 1 is easy to understand, which is also the current method, but I'm not satisfied because all the transformations have been chained and waited for the final action to trigger the computation, if I continue to add transformation and the volume of data, spark will fail at the end, so this method is not scalable.

Method 1:

  def pipeline(refDF1: DataFrame, refDF2: DataFrame, refDF3: DataFrame, refDF4: DataFrame, refDF5: DataFrame): DataFrame = {

  val transformations: List[DataFrame => DataFrame] = List(
    castColumnsFromStringToLong(ColumnsToCastToLong),
    castColumnsFromStringToFloat(ColumnsToCastToFloat),
    renameColumns(RenameMapping),
    filterAndDropColumns,
    joinRefDF1(refDF1),
    joinRefDF2(refDF2),
    joinRefDF3(refDF3),
    joinRefDF4(refDF4),
    joinRefDF5(refDF5),
    calculate()
  )

  transformations.reduce(_ andThen _)

  }

  pipeline(refDF1, refDF2, refDF3, refDF4, refDF5)(principleDF)

Method 2: I've not found a real way to achieve my idea, but I hope to trigger the computation of each join immediately.

according to my test, count() is too heavy for spark and useless for my application, but I don't know how to trigger the join computation with an efficient action . This kind of action is, in fact, the answer to this question.

  val joinedDF_1 = castColumnsFromStringToLong(principleDF, ColumnsToCastToLong)
  joinedDF_1.cache() // joinedDF is not always used multiple times, but for some data frame, it is, so I add cache() to indicate the usage
  joinedDF_1.count()  

  val joinedDF_2 = castColumnsFromStringToFloat(joinedDF_1, ColumnsToCastToFloat)
  joinedDF_2.cache()
  joinedDF_2.count()

  val joinedDF_3 = renameColumns(joinedDF_2, RenameMapping)
  joinedDF_3.cache()
  joinedDF_3.count()

  val joinedDF_4 = filterAndDropColumns(joinedDF_4)
  joinedDF_4.cache()
  joinedDF_4.count()

  ...

When you want to force the computation of a given join (or any transformation that is not final) in Spark, you can use a simple show or count on your DataFrame . This kind of terminal points will force the computation of the result because otherwise it is simply not possible to execute the action.

Only after this will your DataFrame be effectively stored in your cache.

Once you're finished with a given DataFrame , don't hesitate to unpersist. This will unpersist your data if your cluster need more room for further computation.

You need to repartitions your dataset with the columns before calling the join transformation.

Example:

df1=df1.repartion(col("col1"),col("col2"))
df2=df2.repartion(col("col1"),col("col2"))
joinDF = df1.join(jf2,df1.col("col1").equals(df2.col("col1")) &....)

Try creating a new dataframe based on it. Ex: val dfTest = session.createDataFrame(df.rdd, df.schema).cache() dfTest.storageLevel.useMemory // result should be a true.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM