简体   繁体   中英

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark . There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.

The following code shows the number of missing values in a column but displays it in a table format :

from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()

I have tried the following codes:

  • This one does not work as intended as it doesn't drop any columns (as expected)
for c in data.columns:
    if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
        data = data.drop(c)

data.show()
  • This one I am currently trying but takes ages to execute
for c in data.columns:
    if(data.filter(data[c].isNull()).count() == data.count()):
        data = data.drop(c)

data.show()

Is there a way to get ONLY the number? Thanks

If you need the number instead of showing in the table format, you need to use the .collect() , which is:

list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()

What you get is a list of Row, which contain all the information in the table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM