简体   繁体   中英

Pyspark: Prevent Column value from changing once calculated

I have a spark dataframe which I am using to store keys generated for encryption. I use a UDF to generate a key for each row but whenever the dataframe is queried the keys change. I have tried wrapping the function in lit() but it still changes. How can I make this column Immutable?

Code:

def generate_key():
    encryptionKey = Fernet.generate_key().decode('utf-8')
    return encryptionKey

get_key = udf(generate_key, StringType())

def key_table(df):
    df        = df.select("id").withColumn('encKey', lit(get_key()))
    return df

Output: https://i.stack.imgur.com/8zXcv.png

UDF in Spark are rerun anytime the data is evicted out of memory and Spark has to recompute the column, if your UDF is deterministic (ie) always returns the same output given the same input, then this causes no problem. However in your case the UDF is non-deterministic . One way to overcome this problem would be to checkpoint the dataframe and then using it further.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM