I have a spark dataframe which I am using to store keys generated for encryption. I use a UDF to generate a key for each row but whenever the dataframe is queried the keys change. I have tried wrapping the function in lit() but it still changes. How can I make this column Immutable?
Code:
def generate_key():
encryptionKey = Fernet.generate_key().decode('utf-8')
return encryptionKey
get_key = udf(generate_key, StringType())
def key_table(df):
df = df.select("id").withColumn('encKey', lit(get_key()))
return df
UDF in Spark are rerun anytime the data is evicted out of memory and Spark has to recompute the column, if your UDF is deterministic
(ie) always returns the same output given the same input, then this causes no problem. However in your case the UDF is non-deterministic
. One way to overcome this problem would be to checkpoint
the dataframe and then using it further.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.