Spark Dataframe Select Columns After Transformation

Question

I am checking for NULL values for 2 out of 6 columns in my DF. But when I apply the in-built functions and use select the resultant DF does not have the remaining columns. Is there a better way to do without using UDFs.

handle_null_cols = [ 'col1', 'col3' ]

# df_null = df.select([ myFunc(col_name).alias(col_name) for col_name in df.columns ])
df_null = df.select( [ myFunc(col_name).alias(col_name) for col_name in handle_null_cols ])

df_null.printSchema() # Resultant DF has only 2 columns selected

col1:int
col3:int

Need to reuse the same DF df_null to do some more transformations downstream with all the columns originally in df .

Answer 1

Why won't you do something like this?

df.select([
    myFunc(col_name).alias(col_name) if col_name in handle_null_cols
    else col_name
    for col_name in df.columns
])

reduce + withColumn is more cryptic but viable solution:

from functools import reduce

reduce(
    lambda df, col_name: df.withColumn(col_name, myFunc(col_name)), 
    handle_null_cols,
    df)

But it sounds a bit like you actually want na functions:

df.na.fill(0, subset=handle_null_cols)

Answer 2

I think i figured it out based on @user9613318 insights. More easy on the eye. And performance efficient as well?

handle_null_cols = [ 'col1', 'col3' ]

df_null = ( df.select(*[myFunc(col).alias(col) 
if col in handle_null_cols else col for col in df.columns]))

Spark Dataframe Select Columns After Transformation

Question

2 answers

solution1
2 ACCPTED 2018-04-14 18:32:11

solution2
0 2018-04-14 19:10:13

Spark Dataframe Select Columns After Transformation

Question

2 answers

solution1 2 ACCPTED 2018-04-14 18:32:11

solution2 0 2018-04-14 19:10:13

solution1
2 ACCPTED 2018-04-14 18:32:11

solution2
0 2018-04-14 19:10:13