在 pyspark 中使用 function 的逐行操作

Question

I have a function like this:我有一个像这样的 function：

  def number(row):
        if row['temp'] == '1 Person':
            return 'One'
        elif row['temp'] == '2 Persons':
            return 'Two'
        elif row['temp'] == '3 Persons':
            return 'Three'
        elif row['temp'] in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
            '9 Persons','10 Persons','11 Persons']:
            return 'More'
        else:
            return None

Now, I want to change the values in my data frame looping through row-wise.现在，我想通过逐行循环更改数据框中的值。

How can I loop through my data frame and replace the values according to the function above in Pyspark?如何循环遍历我的数据帧并根据上面 Pyspark 中的 function 替换值？

Answer 1

Create a sample data frame with the data使用数据创建示例数据框

 df_row = spark.createDataFrame( [("1 Person", "2", "3"),("2 Persons", "2", "3"), ("3 Persons", "2", "3"),("4 Persons", "2", "3"), ("5 Persons", "2", "3"),("6 Persons", "2", "3"), ("7 Persons", "2", "3"),("8 Persons", "2", "3"), ("9 Persons", "2", "3")], schema=['temp', 'col2', 'col3'] )

Define the function and also create the UDF function定义 function 并创建 UDF function

 def number(row): if row == '1 Person': return 'One' elif row == '2 Persons': return 'Two' elif row == '3 Persons': return 'Three' elif row in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons', '9 Persons','10 Persons','11 Persons']: return 'More' else: return None numberUDF = udf(lambda z: number(z),StringType())

Rewrite the column function 'temp'重写列 function 'temp'

 df_row = df_row.withColumn('temp',numberUDF(col('temp')))

Please check the below output:请检查以下 output：

enter image description here在此处输入图像描述

Answer 2

The Python function can (almost identical) be translated into a native SQL statement : Python function 可以（几乎相同）翻译成原生SQL 语句：

from pyspark.sql import functions as F

df.withColumn('result', 
    F.expr("""case
                when temp == '1 Person' then 'One'
                when temp == '2 Persons' then 'Two'
                when temp == '3 Persons' then 'Three'                
                when temp in ('4 Persons', '5 Persons', '6 Persons', '7 Persons','8 Persons','9 Persons','10 Persons','11 Persons') then 'More'
                else null 
              end""")).show()

While an udf based solution offers more flexibility, the sql statements has a better performance as no Python code has to be run by Spark.虽然基于 udf 的解决方案提供了更大的灵活性，但 sql 语句具有更好的性能，因为 Spark 不需要运行 Python 代码。

在 pyspark 中使用 function 的逐行操作

问题描述

2 个解决方案

解决方案1
2 2022-09-17 20:17:16

解决方案2
0 2022-09-18 12:30:11

在 pyspark 中使用 function 的逐行操作

问题描述

2 个解决方案

解决方案1 2 2022-09-17 20:17:16

解决方案2 0 2022-09-18 12:30:11

解决方案1
2 2022-09-17 20:17:16

解决方案2
0 2022-09-18 12:30:11