简体   繁体   中英

How to apply a customized function with multiple parameters to each group of a dataframe and union the resulting dataframes in Scala Spark?

I have a customized function that looks like this that returns a different dataframe as the output

def customizedfun(data : DataFrame, param1 : Boolean, param2 : string) : DataFrame = {...}

and I want to apply this function to each group of

df.groupBy("type")

then append the output dataframes from each type into one dataframe.

This is a little different from other questions regarding applying customized functions to grouped dataframes in that this function also take other inputs, in addition to the dataframe in question df.groupBy("type") .

What's the best way to do this?

You can filter down the original df to the different groups, call customizedfun for each group and then union the results.

I assume that customizedfun is a function that simply adds the two parameters as a new column, but it could be any function:

def customizedfun(data : DataFrame, param1 : Boolean, param2 : String) : DataFrame =
  data.withColumn("newCol", lit(s"$param2 $param1"))

I need two helper function that calculate the values of param1 and param2 dependent on the value of type . In a real world application, these functions could be for example a lookup into a dictionary.

def calcParam1(typ: Integer): Boolean = typ % 2 == 0
def calcParam2(typ: Integer): String = s"type is $typ"

Now the original df is filtered into the different groups, customizedfun is called and the result is unioned:

//create some test data
val df = Seq((1, "A", "a"), (1, "B", "b"), (1, "C", "c"), (2, "D", "d"), (2, "E", "e"), (3, "F", "f"))
  .toDF("type", "val1", "val2")
//+----+----+----+
//|type|val1|val2|
//+----+----+----+
//|   1|   A|   a|
//|   1|   B|   b|
//|   1|   C|   c|
//|   2|   D|   d|
//|   2|   E|   e|
//|   3|   F|   f|
//+----+----+----+

//get the distinct values of column type
val distinctTypes = df.select("type").distinct().as[Integer].collect()

//call customizedfun for each group
val resultPerGroup= for( typ <- distinctTypes)
  yield customizedfun( df.filter(s"type = $typ"), calcParam1(typ), calcParam2(typ))

//the final union
val result = resultPerGroup.tail.foldLeft(resultPerGroup.head)(_ union _)

//+----+----+----+---------------+
//|type|val1|val2|         newCol|
//+----+----+----+---------------+
//|   1|   A|   a|type is 1 false|
//|   1|   B|   b|type is 1 false|
//|   1|   C|   c|type is 1 false|
//|   3|   F|   f|type is 3 false|
//|   2|   D|   d| type is 2 true|
//|   2|   E|   e| type is 2 true|
//+----+----+----+---------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM