Scala - Which is the most efficient way to get the names of the null columns from spark dataframe?

Question

I have a df like this:

+----------+----------+----------+----------+----------+----------+----------+
|   user_id|     apple|    orange|    banana|      pear|     table|      desk|
+----------+----------+----------+----------+----------+----------+----------+
|         1|        13|      null|        55|      null|      null|      null|
|         2|        30|      null|      null|      null|      null|      null|
|         3|      null|      null|        50|      null|      null|      null|
|         4|         1|      null|         3|      null|      null|      null|
+----------+----------+----------+----------+----------+----------+----------+

I would like to get back an Array[String] which contains the fruit column names which are have only null values. I would like to do this on a very big data frame so i don't want to sum the columns, i need a faster and much more efficient way. I need a Scala code.

So i need this list:

List(orange,pear)

I have this solution now, summing columns, but i need a solution without summing all of the columns:

val fruitList:  Array[String] = here are the fruit names 
val nullFruits: Array[String] = fruitList.filter(col => dataFrame.agg(sum(col)).first.get(0) == null)

Answer 1

You can achieve this by using Spark's describe too:

val r1  = df.select(fruitList.head, fruitList.tail :_*)
  .summary("count")

//alternatively

val r1 = df.select(fruitList.head, fruitList.tail :_*)
   .describe()
   .filter($"summary" === "count")

+-------+-----+------+------+----+
|summary|apple|orange|banana|pear|
+-------+-----+------+------+----+
|  count|    3|     0|     3|   0|
+-------+-----+------+------+----+

And to extract the desired values:

r1.columns.tail
  .map(c => (c,r1.select(c).head.getString(0) == "0"))
  .filter(_._2 == true)
  .map(_._1)

which gives:

Array(orange, pear)

Scala - Which is the most efficient way to get the names of the null columns from spark dataframe?

Question

1 answers

solution1
1 2020-03-20 22:05:41

Scala - Which is the most efficient way to get the names of the null columns from spark dataframe?

Question

1 answers

solution1 1 2020-03-20 22:05:41

solution1
1 2020-03-20 22:05:41