简体   繁体   English

在 pyspark 中使用 function 的逐行操作

[英]row-wise operation using function in pyspark

I have a function like this:我有一个像这样的 function:

  def number(row):
        if row['temp'] == '1 Person':
            return 'One'
        elif row['temp'] == '2 Persons':
            return 'Two'
        elif row['temp'] == '3 Persons':
            return 'Three'
        elif row['temp'] in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
            '9 Persons','10 Persons','11 Persons']:
            return 'More'
        else:
            return None

Now, I want to change the values in my data frame looping through row-wise.现在,我想通过逐行循环更改数据框中的值。

How can I loop through my data frame and replace the values according to the function above in Pyspark?如何循环遍历我的数据帧并根据上面 Pyspark 中的 function 替换值?

  1. Create a sample data frame with the data使用数据创建示例数据框

     df_row = spark.createDataFrame( [("1 Person", "2", "3"),("2 Persons", "2", "3"), ("3 Persons", "2", "3"),("4 Persons", "2", "3"), ("5 Persons", "2", "3"),("6 Persons", "2", "3"), ("7 Persons", "2", "3"),("8 Persons", "2", "3"), ("9 Persons", "2", "3")], schema=['temp', 'col2', 'col3'] )
  2. Define the function and also create the UDF function定义 function 并创建 UDF function

     def number(row): if row == '1 Person': return 'One' elif row == '2 Persons': return 'Two' elif row == '3 Persons': return 'Three' elif row in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons', '9 Persons','10 Persons','11 Persons']: return 'More' else: return None numberUDF = udf(lambda z: number(z),StringType())
  3. Rewrite the column function 'temp'重写列 function 'temp'

     df_row = df_row.withColumn('temp',numberUDF(col('temp')))

Please check the below output:请检查以下 output:

enter image description here在此处输入图像描述

The Python function can (almost identical) be translated into a native SQL statement : Python function 可以(几乎相同)翻译成原生SQL 语句

from pyspark.sql import functions as F

df.withColumn('result', 
    F.expr("""case
                when temp == '1 Person' then 'One'
                when temp == '2 Persons' then 'Two'
                when temp == '3 Persons' then 'Three'                
                when temp in ('4 Persons', '5 Persons', '6 Persons', '7 Persons','8 Persons','9 Persons','10 Persons','11 Persons') then 'More'
                else null 
              end""")).show()

While an udf based solution offers more flexibility, the sql statements has a better performance as no Python code has to be run by Spark.虽然基于 udf 的解决方案提供了更大的灵活性,但 sql 语句具有更好的性能,因为 Spark 不需要运行 Python 代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM