How to get the numeric value of missing values in a PySpark column?

Question

I am working with the OpenFoodFacts dataset using PySpark . There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.

The following code shows the number of missing values in a column but displays it in a table format :

from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()

I have tried the following codes:

This one does not work as intended as it doesn't drop any columns (as expected)

for c in data.columns:
    if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
        data = data.drop(c)

data.show()

This one I am currently trying but takes ages to execute

for c in data.columns:
    if(data.filter(data[c].isNull()).count() == data.count()):
        data = data.drop(c)

data.show()

Is there a way to get ONLY the number? Thanks

Answer 1

If you need the number instead of showing in the table format, you need to use the .collect() , which is:

list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()

What you get is a list of Row, which contain all the information in the table.

How to get the numeric value of missing values in a PySpark column?

Question

1 answers

solution1
1 ACCPTED 2022-11-21 10:13:23

How to get the numeric value of missing values in a PySpark column?

Question

1 answers

solution1 1 ACCPTED 2022-11-21 10:13:23

solution1
1 ACCPTED 2022-11-21 10:13:23