简体   繁体   English

Pyspark:如何以数据框的行作为参数来应用用户定义的函数?

[英]Pyspark: How to apply a user defined function with row of a data frame as the argument?

I have a PySpark dataframe with 87 columns. 我有一个87列的PySpark数据框。 I want to pass each row of the dataframe to a function and get a list for each row so that I can create a column separately. 我想将数据框的每一行传递给一个函数,并获取每一行的列表,以便可以单独创建一列。

PySpark code PySpark代码

UDF: UDF:

def make_range_vector(row,categories,ledger):
    print(type(row),type(categories),type(ledger))                
    category_vector=[]
    for category in categories:
      if(row[category]!=0):
         category_percentage=func.round(row[category]*100/row[ledger])
         category_vector.append(category_percentage)
      else:
          category_vector.append(0)
    category_vector=sqlCtx.createDataFrame(category_vector,IntegerType())    
    return category_vector

Main function 主功能

pivot_card.withColumn('category_debit_vector',(make_range_vector(struct([pivot_card[x]  for x in pivot_card.columns] ),pivot_card.columns[3:],'debit')))

I am beginner in PySpark, and I can't find answers to below questions. 我是PySpark的初学者,但无法找到以下问题的答案。

  1. if(row[category]!=0): This statement gives me ValueError: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. if(row [category]!= 0):该语句为我提供ValueError: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

  2. So, I printed the arguments inside the function. 因此,我在函数内部打印了参数。 It outputs, <class 'pyspark.sql.column.Column'> <class 'list'> <class #'str'> . 它输出<class 'pyspark.sql.column.Column'> <class 'list'> <class #'str'> Shouldn't it be StructType? 不应该是StructType吗?

  3. Can I pass a Row object and do something similar, like we do in Pandas ? 我可以像在Pandas中那样传递Row对象并执行类似的操作吗?

I looked at many sources, and is mostly taken from this question and this source ( https://community.hortonworks.com/questions/130866/rowwise-manipulation-of-a-dataframe-in-pyspark.html ) 我查看了许多资料,并且大部分取材于该问题和该资料( https://community.hortonworks.com/questions/130866/rowwise-manipulation-of-a-dataframe-in-pyspark.html

PySpark row-wise function composition PySpark按行功能组合

I found the silly mistake I made in the code. 我发现了我在代码中犯的愚蠢错误。 Instead of calling the UDF, I called the original function. 我没有调用UDF,而是调用了原始函数。 Have corrected it in the answer below: 在以下答案中已更正它:

Main function 主功能

pivot_card.withColumn('category_debit_vector',(make_range_vector_udf(struct([pivot_card[x] for x in pivot_card.columns] ),pivot_card.columns[3:],'debit')))

EDIT 编辑

I have understood that we cannot really pass other arguments in UDF. 我知道我们不能真正在UDF中传递其他参数。 Thanks. 谢谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在pandas数据框上应用已定义的函数 - How to apply a defined function on pandas data frame 如何将字典从用户定义的函数返回到pyspark数据框? - How can I return a dictionary from a user defined function to a pyspark data frame? 如何将自定义函数应用于每行的pandas数据框 - How to apply custom function to pandas data frame for each row 在pyspark中应用用户定义的聚合函数的替代方法 - Alternative ways to apply a user defined aggregate function in pyspark 如何在 pandas 中的分组数据上按列应用用户定义的 function - how to apply a user defined function column wise on grouped data in pandas 将自定义 function 应用于 PySpark 中数据框的选定列的单元格 - Apply custom function to cells of selected columns of a data frame in PySpark Python,将函数应用于数据为参数的熊猫数据框 - Python, apply a function to a pandas data frame where the data is an argument 将函数应用于Dask中的分组数据框:如何在函数中将分组的Dataframe指定为参数? - Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function? 如何将用户定义的行函数应用于加载的数组的所有行? - How can I apply a user-defined row function onto all rows of a loaded in array? 将 function 应用于数据框 - Apply a function to data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM