[英]How to iterate over a pyspark.sql.Column?
I have a pyspark DataFrame and I want to get a specific column and iterate over its values. 我有一个pyspark DataFrame,我想获取一个特定的列并对其值进行迭代。 For example: 例如:
userId itemId
1 2
2 2
3 7
4 10
I get the userId column by df.userId
and for each userId in this column I want to apply a method. 我通过df.userId
获得userId列,并且对于此列中的每个userId我都想应用一个方法。 How can I achieve this? 我该如何实现?
Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId
. 您的问题不是很具体要应用的功能类型,因此我创建了一个示例,该示例根据itemId
的值添加了一个项目描述。
First let's import the relevant libraries and create the data: 首先,让我们导入相关的库并创建数据:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])
Secondly, create the function and convert it into an UDF function that can be used by PySpark: 其次,创建函数并将其转换为PySpark可以使用的UDF函数:
def item_description(itemId):
items = {2 : "iPhone 8",
7 : "Apple iMac",
10 : "iPad"}
return items[itemId]
item_description_udf = udf(item_description,StringType())
Finally, add a new column for ItemDescription
and populate it with the value returned by the item_description_udf
function: 最后,为ItemDescription
添加一个新列,并使用item_description_udf
函数返回的值填充它:
df = df.withColumn("ItemDescription",item_description_udf(df.itemId))
df.show()
This gives the following output: 这给出以下输出:
+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
| 1| 2| iPhone 8|
| 2| 2| iPhone 8|
| 3| 7| Apple iMac|
| 4| 10| iPad|
+------+------+---------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.