How to repartition a dataframe based on more than one column?

Question

I have a dataframe: yearDF with the following columns: name, id_number, location, source_system_name, period_year .

If I want to repartition the dataframe based on a column, I'd do:

yearDF.repartition('source_system_name')

I have a variable: val partition_columns = "source_system_name,period_year"

I tried to do it this way:

val dataDFPart = yearDF.repartition(col(${prtn_String_columns}))

but I get a compilation error: cannot resolve the symbol $

Is there anyway I can repartition the dataframe: yearDF based on the values in partition_columns

Answer 1

There are three implementations of the repartition function in Scala / Spark :

def repartition(partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
def repartition(numPartitions: Int): Dataset[T]

So in order to repartition on multiple columns, you can try to split your field by the comma and use the vararg operator of Scala on it, like this :

val columns = partition_columns.split(",").map(x => col(x))
yearDF.repartition(columns: _*)

Another way to do it, is to call every col one by one :

yearDF.repartition(col("source_system_name"), col("period_year"))

How to repartition a dataframe based on more than one column?

Question

1 answers

solution1
0 ACCPTED 2018-09-24 12:45:41

How to repartition a dataframe based on more than one column?

Question

1 answers

solution1 0 ACCPTED 2018-09-24 12:45:41

solution1
0 ACCPTED 2018-09-24 12:45:41