嗨，我正在尝试在不使用 spark_df.collect() 的情况下迭代 pyspark 数据帧

Question

HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to iterate?嗨，我正在尝试在不使用spark_df.collect()的情况下迭代 pyspark 数据帧，并且我正在尝试 foreach 和 map 方法，还有其他方法可以迭代吗？

df.foreach(lambda x: print(x)) and df.foreach(lambda x: print(x))和

def func1(x):
    firstname=x.firstname
    lastName=x.lastName
    name=firstName+","+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)```

```rdd2=df.rdd.map(lambda x: func1(x))```

is there any other way to iterate over data frame

Answer 1

First of all, Spark is not made to do such kind of operations, like printing each records but more on distributed processing.首先，Spark 不是用来做这种操作的，比如打印每条记录，而更多的是分布式处理。 Tune your process to work in a distributed fashion, like in terms of joins - that will unleash the power of spark.调整您的流程以分布式方式工作，例如连接 - 这将释放火花的力量。

If you want to process each record, UDFs ( User defined functions ) are a good way to do that.如果您想处理每条记录，UDF（用户定义的函数）是一个很好的方法。 UDFs will be applied to each record once. UDF 将应用于每条记录一次。

Answer 2

we can use this method to iterate over rows我们可以使用这个方法来遍历行

pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
    print(row['itm_mtl_no'], row['itm_src_sys_cd']) ```

嗨，我正在尝试在不使用 spark_df.collect() 的情况下迭代 pyspark 数据帧

问题描述

2 个解决方案

解决方案1
1 2022-08-22 08:33:15

解决方案2
0 2022-08-22 05:07:53

嗨，我正在尝试在不使用 spark_df.collect() 的情况下迭代 pyspark 数据帧

问题描述

2 个解决方案

解决方案1 1 2022-08-22 08:33:15

解决方案2 0 2022-08-22 05:07:53

解决方案1
1 2022-08-22 08:33:15

解决方案2
0 2022-08-22 05:07:53