Pyspark：防止列值在计算后发生变化

Question

I have a spark dataframe which I am using to store keys generated for encryption.我有一个 spark dataframe 用于存储为加密生成的密钥。 I use a UDF to generate a key for each row but whenever the dataframe is queried the keys change.我使用 UDF 为每一行生成一个密钥，但是每当查询 dataframe 时，密钥就会发生变化。 I have tried wrapping the function in lit() but it still changes.我尝试将 function 包装在 lit() 中，但它仍然会发生变化。 How can I make this column Immutable?如何使此列不可变？

Code:代码：

def generate_key():
    encryptionKey = Fernet.generate_key().decode('utf-8')
    return encryptionKey

get_key = udf(generate_key, StringType())

def key_table(df):
    df        = df.select("id").withColumn('encKey', lit(get_key()))
    return df

Output: https://i.stack.imgur.com/8zXcv.png Output: https://i.stack.imgur.com/8zXcv.png

Answer 1

UDF in Spark are rerun anytime the data is evicted out of memory and Spark has to recompute the column, if your UDF is deterministic (ie) always returns the same output given the same input, then this causes no problem.每当数据被逐出 memory 并且 Spark 必须重新计算列时，Spark 中的 UDF 都会重新运行，如果您的 UDF 是deterministic的（即）在给定相同输入的情况下始终返回相同的 output，那么这不会导致任何问题。 However in your case the UDF is non-deterministic .但是，在您的情况下，UDF 是non-deterministic 。 One way to overcome this problem would be to checkpoint the dataframe and then using it further.解决此问题的一种方法是checkpoint dataframe，然后进一步使用它。

Pyspark：防止列值在计算后发生变化

问题描述

1 个解决方案

解决方案1
0 2022-09-16 16:20:36

Pyspark：防止列值在计算后发生变化

问题描述

1 个解决方案

解决方案1 0 2022-09-16 16:20:36

解决方案1
0 2022-09-16 16:20:36