I have a pyspark DataFrame and I want to get a specific column and iterate over its values. For example:
userId itemId
1 2
2 2
3 7
4 10
I get the userId column by df.userId
and for each userId in this column I want to apply a method. How can I achieve this?
Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId
.
First let's import the relevant libraries and create the data:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])
Secondly, create the function and convert it into an UDF function that can be used by PySpark:
def item_description(itemId):
items = {2 : "iPhone 8",
7 : "Apple iMac",
10 : "iPad"}
return items[itemId]
item_description_udf = udf(item_description,StringType())
Finally, add a new column for ItemDescription
and populate it with the value returned by the item_description_udf
function:
df = df.withColumn("ItemDescription",item_description_udf(df.itemId))
df.show()
This gives the following output:
+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
| 1| 2| iPhone 8|
| 2| 2| iPhone 8|
| 3| 7| Apple iMac|
| 4| 10| iPad|
+------+------+---------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.