如何遍历pyspark.sql.Column？

Question

I have a pyspark DataFrame and I want to get a specific column and iterate over its values. 我有一个pyspark DataFrame，我想获取一个特定的列并对其值进行迭代。 For example: 例如：

userId    itemId
1         2
2         2
3         7
4         10

I get the userId column by df.userId and for each userId in this column I want to apply a method. 我通过df.userId获得userId列，并且对于此列中的每个userId我都想应用一个方法。 How can I achieve this? 我该如何实现？

Answer 1

Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId . 您的问题不是很具体要应用的功能类型，因此我创建了一个示例，该示例根据itemId的值添加了一个项目描述。

First let's import the relevant libraries and create the data: 首先，让我们导入相关的库并创建数据：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])

Secondly, create the function and convert it into an UDF function that can be used by PySpark: 其次，创建函数并将其转换为PySpark可以使用的UDF函数：

def item_description(itemId):
    items = {2  : "iPhone 8",
             7  : "Apple iMac",
             10 : "iPad"}
    return items[itemId]

item_description_udf = udf(item_description,StringType())

Finally, add a new column for ItemDescription and populate it with the value returned by the item_description_udf function: 最后，为ItemDescription添加一个新列，并使用item_description_udf函数返回的值填充它：

df = df.withColumn("ItemDescription",item_description_udf(df.itemId))    
df.show()

This gives the following output: 这给出以下输出：

+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
|     1|     2|       iPhone 8|
|     2|     2|       iPhone 8|
|     3|     7|     Apple iMac|
|     4|    10|           iPad|
+------+------+---------------+

如何遍历pyspark.sql.Column？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-10-04 09:23:50

如何遍历pyspark.sql.Column？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-10-04 09:23:50

解决方案1
0 已采纳 2017-10-04 09:23:50