简体   繁体   中英

PySpark: Get first Non-null value of each column in dataframe

I'm dealing with different Spark DataFrames , which have lot of Null values in many columns. I want to get any one non-null value from each of the column to see if that value can be converted to datetime.

I tried doing df.na.drop().first() in a hope that it'll drop all rows with any null value, and of the remaining DataFrame , I'll just get the first row with all non-null values. But many of the DataFrames have so many columns with lot of null values, that df.na.drop() returns empty DataFrame .

I also tried finding if any columns has all null values, so that I could simply drop that columns before trying the above approach, but that still not solved the problem. Any idea how can I accomplish this in efficient way, as this code will be run many times on huge DataFrames ?

You can use first function with ingorenulls . Let's say data looks like this:

from pyspark.sql.types import StringType, StructType, StructField

schema = StructType([
    StructField("x{}".format(i), StringType(), True) for i in range(3)
])

df = spark.createDataFrame(
    [(None, "foo", "bar"), ("foo", None, "bar"), ("foo", "bar", None)],
    schema
)

You can:

from pyspark.sql.functions import first

df.select([first(x, ignorenulls=True).alias(x) for x in df.columns]).first()
Row(x0='foo', x1='foo', x2='bar')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM