I have a df like this:
+----------+----------+----------+----------+----------+----------+----------+
| user_id| apple| orange| banana| pear| table| desk|
+----------+----------+----------+----------+----------+----------+----------+
| 1| 13| null| 55| null| null| null|
| 2| 30| null| null| null| null| null|
| 3| null| null| 50| null| null| null|
| 4| 1| null| 3| null| null| null|
+----------+----------+----------+----------+----------+----------+----------+
I would like to get back an Array[String] which contains the fruit column names which are have only null values. I would like to do this on a very big data frame so i don't want to sum the columns, i need a faster and much more efficient way. I need a Scala code.
So i need this list:
List(orange,pear)
I have this solution now, summing columns, but i need a solution without summing all of the columns:
val fruitList: Array[String] = here are the fruit names
val nullFruits: Array[String] = fruitList.filter(col => dataFrame.agg(sum(col)).first.get(0) == null)
You can achieve this by using Spark's describe
too:
val r1 = df.select(fruitList.head, fruitList.tail :_*)
.summary("count")
//alternatively
val r1 = df.select(fruitList.head, fruitList.tail :_*)
.describe()
.filter($"summary" === "count")
+-------+-----+------+------+----+
|summary|apple|orange|banana|pear|
+-------+-----+------+------+----+
| count| 3| 0| 3| 0|
+-------+-----+------+------+----+
And to extract the desired values:
r1.columns.tail
.map(c => (c,r1.select(c).head.getString(0) == "0"))
.filter(_._2 == true)
.map(_._1)
which gives:
Array(orange, pear)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.