I am checking for NULL values for 2 out of 6 columns in my DF. But when I apply the in-built functions and use select the resultant DF does not have the remaining columns. Is there a better way to do without using UDFs.
handle_null_cols = [ 'col1', 'col3' ]
# df_null = df.select([ myFunc(col_name).alias(col_name) for col_name in df.columns ])
df_null = df.select( [ myFunc(col_name).alias(col_name) for col_name in handle_null_cols ])
df_null.printSchema() # Resultant DF has only 2 columns selected
col1:int
col3:int
Need to reuse the same DF df_null
to do some more transformations downstream with all the columns originally in df
.
Why won't you do something like this?
df.select([
myFunc(col_name).alias(col_name) if col_name in handle_null_cols
else col_name
for col_name in df.columns
])
reduce
+ withColumn
is more cryptic but viable solution:
from functools import reduce
reduce(
lambda df, col_name: df.withColumn(col_name, myFunc(col_name)),
handle_null_cols,
df)
But it sounds a bit like you actually want na
functions:
df.na.fill(0, subset=handle_null_cols)
I think i figured it out based on @user9613318 insights. More easy on the eye. And performance efficient as well?
handle_null_cols = [ 'col1', 'col3' ]
df_null = ( df.select(*[myFunc(col).alias(col)
if col in handle_null_cols else col for col in df.columns]))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.