[英]HI I am trying to iterate over pyspark data frame without using spark_df.collect()
HI I am trying to iterate over pyspark data frame without using spark_df.collect()
and I am trying foreach and map method is there any other way to iterate?嗨,我正在尝试在不使用spark_df.collect()
的情况下迭代 pyspark 数据帧,并且我正在尝试 foreach 和 map 方法,还有其他方法可以迭代吗?
df.foreach(lambda x: print(x))
and df.foreach(lambda x: print(x))
和
def func1(x):
firstname=x.firstname
lastName=x.lastName
name=firstName+","+lastName
gender=x.gender.lower()
salary=x.salary*2
return (name,gender,salary)```
```rdd2=df.rdd.map(lambda x: func1(x))```
is there any other way to iterate over data frame
First of all, Spark is not made to do such kind of operations, like printing each records but more on distributed processing.首先,Spark 不是用来做这种操作的,比如打印每条记录,而更多的是分布式处理。 Tune your process to work in a distributed fashion, like in terms of joins - that will unleash the power of spark.调整您的流程以分布式方式工作,例如连接 - 这将释放火花的力量。
If you want to process each record, UDFs ( User defined functions ) are a good way to do that.如果您想处理每条记录,UDF( 用户定义的函数)是一个很好的方法。 UDFs will be applied to each record once. UDF 将应用于每条记录一次。
we can use this method to iterate over rows我们可以使用这个方法来遍历行
pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
print(row['itm_mtl_no'], row['itm_src_sys_cd']) ```
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.