[英]Pyspark: How to apply a user defined function with row of a data frame as the argument?
I have a PySpark dataframe with 87 columns. 我有一个87列的PySpark数据框。 I want to pass each row of the dataframe to a function and get a list for each row so that I can create a column separately.
我想将数据框的每一行传递给一个函数,并获取每一行的列表,以便可以单独创建一列。
def make_range_vector(row,categories,ledger):
print(type(row),type(categories),type(ledger))
category_vector=[]
for category in categories:
if(row[category]!=0):
category_percentage=func.round(row[category]*100/row[ledger])
category_vector.append(category_percentage)
else:
category_vector.append(0)
category_vector=sqlCtx.createDataFrame(category_vector,IntegerType())
return category_vector
pivot_card.withColumn('category_debit_vector',(make_range_vector(struct([pivot_card[x] for x in pivot_card.columns] ),pivot_card.columns[3:],'debit')))
I am beginner in PySpark, and I can't find answers to below questions. 我是PySpark的初学者,但无法找到以下问题的答案。
if(row[category]!=0): This statement gives me ValueError: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
if(row [category]!= 0):该语句为我提供ValueError:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
So, I printed the arguments inside the function. 因此,我在函数内部打印了参数。 It outputs,
<class 'pyspark.sql.column.Column'> <class 'list'> <class #'str'>
. 它输出
<class 'pyspark.sql.column.Column'> <class 'list'> <class #'str'>
。 Shouldn't it be StructType? 不应该是StructType吗?
Can I pass a Row object and do something similar, like we do in Pandas ? 我可以像在Pandas中那样传递Row对象并执行类似的操作吗?
I looked at many sources, and is mostly taken from this question and this source ( https://community.hortonworks.com/questions/130866/rowwise-manipulation-of-a-dataframe-in-pyspark.html ) 我查看了许多资料,并且大部分取材于该问题和该资料( https://community.hortonworks.com/questions/130866/rowwise-manipulation-of-a-dataframe-in-pyspark.html )
I found the silly mistake I made in the code. 我发现了我在代码中犯的愚蠢错误。 Instead of calling the UDF, I called the original function.
我没有调用UDF,而是调用了原始函数。 Have corrected it in the answer below:
在以下答案中已更正它:
pivot_card.withColumn('category_debit_vector',(make_range_vector_udf(struct([pivot_card[x] for x in pivot_card.columns] ),pivot_card.columns[3:],'debit')))
I have understood that we cannot really pass other arguments in UDF. 我知道我们不能真正在UDF中传递其他参数。 Thanks.
谢谢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.