[英]Pass list to udf in dataframe with Colum
I am building dataframe from hive table where i need to transform column based on multiple columns in dataframe, for that i built udf and passing kwargs however i doubt the order of the kwargs gets changed as the order is important.我正在从 hive 表中构建 dataframe,我需要根据 dataframe 中的多个列来转换列,因为我怀疑我构建 udf 和传递 kwar kwarg 的顺序很重要。 So i decided to use List but i am exploring how can we pass multiple columns as list from dataframe transformation.
所以我决定使用列表,但我正在探索如何从 dataframe 转换中将多个列作为列表传递。
function: function:
val = ''
@udf(returnType = StringType())
def func(list):
for i in list
val = val + i
return val
df = df.withColumn(new_col,func(df["col1"],df["col2"],df["col3"])
df.show()
the below dynamic column approach might solve your problem.下面的动态列方法可能会解决您的问题。
from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = spark.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
'''
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
'''
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols2',concat(*col_list))
col_list = ['col1','col2','col3']
df = df.withColumn('concatenated_cols3',concat(*col_list))
col_list = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols4',concat(*col_list))
df.show()
'''
+----+----+----+----+------------------+------------------+------------------+
|col1|col2|col3|col4|concatenated_cols2|concatenated_cols3|concatenated_cols4|
+----+----+----+----+------------------+------------------+------------------+
| A1| 11| A3| A4| A111| A111A3| A111A3A4|
| B1| 22| B3| B4| B122| B122B3| B122B3B4|
| C1| 33| C3| C4| C133| C133C3| C133C3C4|
+----+----+----+----+------------------+------------------+------------------+
'''
Thanks Smart_Coder.感谢 Smart_Coder。 and Sorry for the delay in getting back to you.
很抱歉延迟回复您。 Let me give your the full requirement.
让我给你的全部要求。 i will take the dataframe as you mentioned above an example.
我将以上面提到的 dataframe 为例。 and i will take 3 columns as input(It should be dynamic but will take these fro now).
我将采用 3 列作为输入(它应该是动态的,但现在将采用这些列)。 col1, col2, col3 are the input columns to function.columns values should move from right to left in case of null or empty values.
col1、col2、col3 是 function.columns 的输入列。如果是 null 或空值,列值应从右向左移动。 extension of the requirement: then i need to check the count of characters in each value and take only specific no.
要求的扩展:那么我需要检查每个值中的字符数,并且只取特定的编号。 of characters into that column and rest of the columns should go into next column, if still it is ore than specific number of characters then rest will go into next column.
of characters into that column and rest of the columns should go into next column, if still it is ore than specific number of characters then rest will go into next column. However we need only 3 columns/elements as output.
但是,我们只需要 3 列/元素作为 output。
col1 col2 col2
ASDF QWER NMVB
QWER NMVB
ASD NMVB
for suppose i need only 3 characters in each field max.
output will be:
col1 col2 col3
ASD F QWE
QWE R NMV
ASD NMV
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.