简体   繁体   中英

Spark Dataframe Select Columns After Transformation

I am checking for NULL values for 2 out of 6 columns in my DF. But when I apply the in-built functions and use select the resultant DF does not have the remaining columns. Is there a better way to do without using UDFs.

handle_null_cols = [ 'col1', 'col3' ]

# df_null = df.select([ myFunc(col_name).alias(col_name) for col_name in df.columns ])
df_null = df.select( [ myFunc(col_name).alias(col_name) for col_name in handle_null_cols ])

df_null.printSchema() # Resultant DF has only 2 columns selected

col1:int
col3:int

Need to reuse the same DF df_null to do some more transformations downstream with all the columns originally in df .

Why won't you do something like this?

df.select([
    myFunc(col_name).alias(col_name) if col_name in handle_null_cols
    else col_name
    for col_name in df.columns
])

reduce + withColumn is more cryptic but viable solution:

from functools import reduce

reduce(
    lambda df, col_name: df.withColumn(col_name, myFunc(col_name)), 
    handle_null_cols,
    df)

But it sounds a bit like you actually want na functions:

df.na.fill(0, subset=handle_null_cols)

I think i figured it out based on @user9613318 insights. More easy on the eye. And performance efficient as well?

handle_null_cols = [ 'col1', 'col3' ]

df_null = ( df.select(*[myFunc(col).alias(col) 
if col in handle_null_cols else col for col in df.columns]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM