简体   繁体   中英

How to repartition a dataframe based on more than one column?

I have a dataframe: yearDF with the following columns: name, id_number, location, source_system_name, period_year .

If I want to repartition the dataframe based on a column, I'd do:

yearDF.repartition('source_system_name')

I have a variable: val partition_columns = "source_system_name,period_year"

I tried to do it this way:

val dataDFPart = yearDF.repartition(col(${prtn_String_columns}))

but I get a compilation error: cannot resolve the symbol $

Is there anyway I can repartition the dataframe: yearDF based on the values in partition_columns

There are three implementations of the repartition function in Scala / Spark :

def repartition(partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int): Dataset[T]

So in order to repartition on multiple columns, you can try to split your field by the comma and use the vararg operator of Scala on it, like this :

val columns = partition_columns.split(",").map(x => col(x))
yearDF.repartition(columns: _*)

Another way to do it, is to call every col one by one :

yearDF.repartition(col("source_system_name"), col("period_year"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM