简体   繁体   中英

How to iterate over a pyspark.sql.Column?

I have a pyspark DataFrame and I want to get a specific column and iterate over its values. For example:

userId    itemId
1         2
2         2
3         7
4         10

I get the userId column by df.userId and for each userId in this column I want to apply a method. How can I achieve this?

Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId .

First let's import the relevant libraries and create the data:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])

Secondly, create the function and convert it into an UDF function that can be used by PySpark:

def item_description(itemId):
    items = {2  : "iPhone 8",
             7  : "Apple iMac",
             10 : "iPad"}
    return items[itemId]

item_description_udf = udf(item_description,StringType())

Finally, add a new column for ItemDescription and populate it with the value returned by the item_description_udf function:

df = df.withColumn("ItemDescription",item_description_udf(df.itemId))    
df.show()

This gives the following output:

+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
|     1|     2|       iPhone 8|
|     2|     2|       iPhone 8|
|     3|     7|     Apple iMac|
|     4|    10|           iPad|
+------+------+---------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM