Pyspark: Prevent Column value from changing once calculated

Question

I have a spark dataframe which I am using to store keys generated for encryption. I use a UDF to generate a key for each row but whenever the dataframe is queried the keys change. I have tried wrapping the function in lit() but it still changes. How can I make this column Immutable?

Code:

def generate_key():
    encryptionKey = Fernet.generate_key().decode('utf-8')
    return encryptionKey

get_key = udf(generate_key, StringType())

def key_table(df):
    df        = df.select("id").withColumn('encKey', lit(get_key()))
    return df

Output: https://i.stack.imgur.com/8zXcv.png

Answer 1

UDF in Spark are rerun anytime the data is evicted out of memory and Spark has to recompute the column, if your UDF is deterministic (ie) always returns the same output given the same input, then this causes no problem. However in your case the UDF is non-deterministic . One way to overcome this problem would be to checkpoint the dataframe and then using it further.

Pyspark: Prevent Column value from changing once calculated

Question

1 answers

solution1
0 2022-09-16 16:20:36

Pyspark: Prevent Column value from changing once calculated

Question

1 answers

solution1 0 2022-09-16 16:20:36

solution1
0 2022-09-16 16:20:36