简体   繁体   English

Pyspark - DataFrame 在循环中应用函数时未更新

[英]Pyspark - DataFrame not updated when applying functions in a loop

I'm trying to apply different functions to various columns of a DataFrame depending on a condition.我正在尝试根据条件将不同的功能应用于 DataFrame 的各个列。 When I do this in a loop, fn1 is applied successfully on the first iteration.当我在循环中执行此操作时, fn1在第一次迭代中成功应用。 But the df turns None on the second iteration.但是df在第二次迭代中变为None I guess the problem is the way I'm initializing the df in the scope of a loop.我想问题是我在循环的 scope 中初始化df的方式。

df = spark.createDataFrame([(10,4,2,3),(20,1,3,4),(30,7,4,5),(40,2,1,9)], schema=['id','metric_1','metric_2', 'metric_3'])

cols_info = [{'name':'metric_1','apply_func':'True','method':'fn1'},{'name':'metric_2','apply_func':'True','method':'fn2'}, {'name':'metric_3','apply_func':'True','method':'fn3'}]

def fn1(df, col):
    return df.withColumn(col, F.pow(df[col], 2))

def fn2(df, col):
    return df.withColumn(col, F.hash(df[col]))

def fn3(df, col):
    return df.withColumn(col, F.log2(df[col]))

def process_data(df, columns):
    for col in columns:
        if col["apply_func"] == "True":
            if column["method"] == "fn1":
                df = fn1(df, col["name"])
            if column["method"] == "fn2":
                df = fn2(df, col["name"])
            if column["method"] == "fn3":
                df = fn3(df, col["name"])

    return df

What is the correct way to apply such transformations with Pyspark DataFrame API?使用 Pyspark DataFrame API 应用此类转换的正确方法是什么?

Can you try to write the functions in this way.你能尝试用这种方式编写函数吗? This way worked for me:这种方式对我有用:

def fn1(df, col):
    df = df.withColumn(col, F.pow(df[col], 2))
    return df


def fn2(df, col):
    df = df.withColumn(col, F.hash(df[col]))
    return df

def fn3(df, col):
    df = df.withColumn(col, F.log2(df[col]))
    return df

def process_data(df, columns):
    for col in columns:
        if col["apply_func"] == "True":
            if col["method"] == "fn1":
                df = fn1(df, col["name"])
            if col["method"] == "fn2":
                df = fn2(df, col["name"])
            if col["method"] == "fn3":
                df = fn3(df, col["name"])
    return df

I think the assignment is necessary but not very sure.我认为分配是必要的,但不是很确定。 Someone could improve on my answer有人可以改进我的答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM