简体   繁体   English

在 for 循环中使用 udf 在 Pyspark 中创建多个列

[英]use udf inside for loop to create multiple columns in Pyspark

这是原始数据框

所需的数据框

I have a spark dataframe with some columns (col1,col2,col3,col4,col5...till 32) now i have create a function (udf) which takes 2-input parameters and return some float values.我有一个带有一些列(col1、col2、col3、col4、col5...直到 32)的 spark dataframe 现在我创建了一个 function (udf),它采用 2 个输入参数并返回一些浮点值。

Now i want to create new columns(in increasing order like col33,col32,col33,col34..) using above function with one parameter increasing and other parameter is constant现在我想使用上面的 function 创建新列(按升序排列,如 col33、col32、col33、col34..),其中一个参数递增,其他参数不变

def fun(col1,col2):
   if true:
      do someting
   else:
      do someting

I have converted this function to udf我已将此 function 转换为 udf

udf_func = udf(fun,Floatype())

Now I want to use this function to create new columns in dataframe how to do that?现在我想使用这个 function 在 dataframe 中创建新列,该怎么做?

I tried我试过

for i in range(1,5):
   BS.withColumns("some_name with increasing number like abc_1,abc_2",udf_func(col1<this should be col1,col2..till 4>,col6<this is fixed>

How to achieve this in PySpark?如何在PySpark中实现这一点?

You can only create one column at a time using withColumn , so we'll have to call it several times.您一次只能使用withColumn创建一列,因此我们必须多次调用它。

# We set up the problem
columns = ["col1", "col2", "col3"]
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   1|   2|   3|
#|   4|   5|   6|
#|   7|   8|   9|
#+----+----+----+

Since your condition is based an if-else condition, you can do the logic within each iteration using when and otherwise .由于您的条件基于 if-else 条件,因此您可以使用whenotherwise在每次迭代中执行逻辑。 Since I don't know your use case, I check for a trivial condition that if colX is even, we add it to col3, if odd, we subtract.由于我不知道你的用例,我检查了一个简单的条件,如果colX是偶数,我们将它添加到 col3,如果是奇数,我们减去。

We create a new column each iteration based on the number at the end of the column name, plus the number of columns (in our case 3), to generate 4, 5, 6.我们根据列名末尾的数字加上列数(在我们的例子中为 3)在每次迭代中创建一个新列,以生成 4、5、6。

# You'll need a function to extract the number at the end of the column name
import re
def get_trailing_number(s):
  m = re.search(r'\d+$', s)
  return int(m.group()) if m else None

from pyspark.sql.functions import col, when
from pyspark.sql.types import FloatType
rich_df = df
for i in df.columns:
  rich_df = rich_df.withColumn(f'col{get_trailing_number(i) + 3}', \
   when(col(i) % 2 == 0, col(i) + col("col3"))\
   .otherwise(col(i) - col("col3")).cast(FloatType()))

rich_df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#|   1|   2|   3|-2.0| 5.0| 0.0|
#|   4|   5|   6|10.0|-1.0|12.0|
#|   7|   8|   9|-2.0|17.0| 0.0|
#+----+----+----+----+----+----+

Here's a UDF version of the function这是 function 的 UDF 版本

def func(col, constant):
  if (col % 2 == 0):
    return float(col + constant)
  else:
    return float(col - constant)

func_udf = udf(lambda col, constant: func(col, constant), FloatType())

rich_df = df
for i in df.columns:
  rich_df = rich_df.withColumn(f'col{get_trailing_number(i) + 3}', \
                               func_udf(col(i), col("col3")))

rich_df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#|   1|   2|   3|-2.0| 5.0| 0.0|
#|   4|   5|   6|10.0|-1.0|12.0|
#|   7|   8|   9|-2.0|17.0| 0.0|
#+----+----+----+----+----+----+

It's hard to say more without understanding what you're trying to do.在不了解您要做什么的情况下很难说更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM