简体   繁体   English

如何遍历pyspark.sql.Column?

[英]How to iterate over a pyspark.sql.Column?

I have a pyspark DataFrame and I want to get a specific column and iterate over its values. 我有一个pyspark DataFrame,我想获取一个特定的列并对其值进行迭代。 For example: 例如:

userId    itemId
1         2
2         2
3         7
4         10

I get the userId column by df.userId and for each userId in this column I want to apply a method. 我通过df.userId获得userId列,并且对于此列中的每个userId我都想应用一个方法。 How can I achieve this? 我该如何实现?

Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId . 您的问题不是很具体要应用的功能类型,因此我创建了一个示例,该示例根据itemId的值添加了一个项目描述。

First let's import the relevant libraries and create the data: 首先,让我们导入相关的库并创建数据:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])

Secondly, create the function and convert it into an UDF function that can be used by PySpark: 其次,创建函数并将其转换为PySpark可以使用的UDF函数:

def item_description(itemId):
    items = {2  : "iPhone 8",
             7  : "Apple iMac",
             10 : "iPad"}
    return items[itemId]

item_description_udf = udf(item_description,StringType())

Finally, add a new column for ItemDescription and populate it with the value returned by the item_description_udf function: 最后,为ItemDescription添加一个新列,并使用item_description_udf函数返回的值填充它:

df = df.withColumn("ItemDescription",item_description_udf(df.itemId))    
df.show()

This gives the following output: 这给出以下输出:

+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
|     1|     2|       iPhone 8|
|     2|     2|       iPhone 8|
|     3|     7|     Apple iMac|
|     4|    10|           iPad|
+------+------+---------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM