[英]use udf inside for loop to create multiple columns in Pyspark
I have a spark dataframe with some columns (col1,col2,col3,col4,col5...till 32) now i have create a function (udf) which takes 2-input parameters and return some float values.我有一个带有一些列(col1、col2、col3、col4、col5...直到 32)的 spark dataframe 现在我创建了一个 function (udf),它采用 2 个输入参数并返回一些浮点值。
Now i want to create new columns(in increasing order like col33,col32,col33,col34..) using above function with one parameter increasing and other parameter is constant现在我想使用上面的 function 创建新列(按升序排列,如 col33、col32、col33、col34..),其中一个参数递增,其他参数不变
def fun(col1,col2):
if true:
do someting
else:
do someting
I have converted this function to udf我已将此 function 转换为 udf
udf_func = udf(fun,Floatype())
Now I want to use this function to create new columns in dataframe how to do that?现在我想使用这个 function 在 dataframe 中创建新列,该怎么做?
I tried我试过
for i in range(1,5):
BS.withColumns("some_name with increasing number like abc_1,abc_2",udf_func(col1<this should be col1,col2..till 4>,col6<this is fixed>
How to achieve this in PySpark?如何在PySpark中实现这一点?
You can only create one column at a time using withColumn
, so we'll have to call it several times.您一次只能使用withColumn
创建一列,因此我们必须多次调用它。
# We set up the problem
columns = ["col1", "col2", "col3"]
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| 1| 2| 3|
#| 4| 5| 6|
#| 7| 8| 9|
#+----+----+----+
Since your condition is based an if-else condition, you can do the logic within each iteration using when
and otherwise
.由于您的条件基于 if-else 条件,因此您可以使用when
和otherwise
在每次迭代中执行逻辑。 Since I don't know your use case, I check for a trivial condition that if colX
is even, we add it to col3, if odd, we subtract.由于我不知道你的用例,我检查了一个简单的条件,如果colX
是偶数,我们将它添加到 col3,如果是奇数,我们减去。
We create a new column each iteration based on the number at the end of the column name, plus the number of columns (in our case 3), to generate 4, 5, 6.我们根据列名末尾的数字加上列数(在我们的例子中为 3)在每次迭代中创建一个新列,以生成 4、5、6。
# You'll need a function to extract the number at the end of the column name
import re
def get_trailing_number(s):
m = re.search(r'\d+$', s)
return int(m.group()) if m else None
from pyspark.sql.functions import col, when
from pyspark.sql.types import FloatType
rich_df = df
for i in df.columns:
rich_df = rich_df.withColumn(f'col{get_trailing_number(i) + 3}', \
when(col(i) % 2 == 0, col(i) + col("col3"))\
.otherwise(col(i) - col("col3")).cast(FloatType()))
rich_df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#| 1| 2| 3|-2.0| 5.0| 0.0|
#| 4| 5| 6|10.0|-1.0|12.0|
#| 7| 8| 9|-2.0|17.0| 0.0|
#+----+----+----+----+----+----+
Here's a UDF version of the function这是 function 的 UDF 版本
def func(col, constant):
if (col % 2 == 0):
return float(col + constant)
else:
return float(col - constant)
func_udf = udf(lambda col, constant: func(col, constant), FloatType())
rich_df = df
for i in df.columns:
rich_df = rich_df.withColumn(f'col{get_trailing_number(i) + 3}', \
func_udf(col(i), col("col3")))
rich_df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#| 1| 2| 3|-2.0| 5.0| 0.0|
#| 4| 5| 6|10.0|-1.0|12.0|
#| 7| 8| 9|-2.0|17.0| 0.0|
#+----+----+----+----+----+----+
It's hard to say more without understanding what you're trying to do.在不了解您要做什么的情况下很难说更多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.