简体   繁体   中英

Scala - Which is the most efficient way to get the names of the null columns from spark dataframe?

I have a df like this:

+----------+----------+----------+----------+----------+----------+----------+
|   user_id|     apple|    orange|    banana|      pear|     table|      desk|
+----------+----------+----------+----------+----------+----------+----------+
|         1|        13|      null|        55|      null|      null|      null|
|         2|        30|      null|      null|      null|      null|      null|
|         3|      null|      null|        50|      null|      null|      null|
|         4|         1|      null|         3|      null|      null|      null|
+----------+----------+----------+----------+----------+----------+----------+

I would like to get back an Array[String] which contains the fruit column names which are have only null values. I would like to do this on a very big data frame so i don't want to sum the columns, i need a faster and much more efficient way. I need a Scala code.

So i need this list:

List(orange,pear)

I have this solution now, summing columns, but i need a solution without summing all of the columns:

val fruitList:  Array[String] = here are the fruit names 
val nullFruits: Array[String] = fruitList.filter(col => dataFrame.agg(sum(col)).first.get(0) == null)

You can achieve this by using Spark's describe too:

val r1  = df.select(fruitList.head, fruitList.tail :_*)
  .summary("count")

//alternatively

val r1 = df.select(fruitList.head, fruitList.tail :_*)
   .describe()
   .filter($"summary" === "count")

+-------+-----+------+------+----+
|summary|apple|orange|banana|pear|
+-------+-----+------+------+----+
|  count|    3|     0|     3|   0|
+-------+-----+------+------+----+

And to extract the desired values:

r1.columns.tail
  .map(c => (c,r1.select(c).head.getString(0) == "0"))
  .filter(_._2 == true)
  .map(_._1)

which gives:

Array(orange, pear)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM