How to iterate over a pyspark.sql.Column?

Question

I have a pyspark DataFrame and I want to get a specific column and iterate over its values. For example:

userId    itemId
1         2
2         2
3         7
4         10

I get the userId column by df.userId and for each userId in this column I want to apply a method. How can I achieve this?

Answer 1

Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId .

First let's import the relevant libraries and create the data:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])

Secondly, create the function and convert it into an UDF function that can be used by PySpark:

def item_description(itemId):
    items = {2  : "iPhone 8",
             7  : "Apple iMac",
             10 : "iPad"}
    return items[itemId]

item_description_udf = udf(item_description,StringType())

Finally, add a new column for ItemDescription and populate it with the value returned by the item_description_udf function:

df = df.withColumn("ItemDescription",item_description_udf(df.itemId))    
df.show()

This gives the following output:

+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
|     1|     2|       iPhone 8|
|     2|     2|       iPhone 8|
|     3|     7|     Apple iMac|
|     4|    10|           iPad|
+------+------+---------------+

How to iterate over a pyspark.sql.Column?

Question

1 answers

solution1
0 ACCPTED 2017-10-04 09:23:50

How to iterate over a pyspark.sql.Column?

Question

1 answers

solution1 0 ACCPTED 2017-10-04 09:23:50

solution1
0 ACCPTED 2017-10-04 09:23:50