简体   繁体   English

嗨,我正在尝试在不使用 spark_df.collect() 的情况下迭代 pyspark 数据帧

[英]HI I am trying to iterate over pyspark data frame without using spark_df.collect()

HI I am trying to iterate over pyspark data frame without using spark_df.collect() and I am trying foreach and map method is there any other way to iterate?嗨,我正在尝试在不使用spark_df.collect()的情况下迭代 pyspark 数据帧,并且我正在尝试 foreach 和 map 方法,还有其他方法可以迭代吗?

df.foreach(lambda x: print(x)) and df.foreach(lambda x: print(x))

def func1(x):
    firstname=x.firstname
    lastName=x.lastName
    name=firstName+","+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)```

```rdd2=df.rdd.map(lambda x: func1(x))```

is there any other way to iterate over data frame

First of all, Spark is not made to do such kind of operations, like printing each records but more on distributed processing.首先,Spark 不是用来做这种操作的,比如打印每条记录,而更多的是分布式处理。 Tune your process to work in a distributed fashion, like in terms of joins - that will unleash the power of spark.调整您的流程以分布式方式工作,例如连接 - 这将释放火花的力量。

If you want to process each record, UDFs ( User defined functions ) are a good way to do that.如果您想处理每条记录,UDF( 用户定义的函数)是一个很好的方法。 UDFs will be applied to each record once. UDF 将应用于每条记录一次。

we can use this method to iterate over rows我们可以使用这个方法来遍历行

pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
    print(row['itm_mtl_no'], row['itm_src_sys_cd']) ```

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Pyspark中使用collect()方法将pyspark.rdd.PipelinedRDD转换为数据框? - How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark? 如何遍历数据框并将文件夹创建为 df 中的列,将子文件夹创建为 df 中的行? - How can I iterate through a data frame and create folders as columns in the df and subfolders as rows in the df? 迭代 df 中的列而不使该列成为自己的 df - iterate over column in df without making the column its own df 我正在尝试遍历返回3个元组的元组的函数的返回值 - I am trying to iterate over the return value of a function which returns a tuple of 3 tuples 如何使用 python 在 pandas 数据帧中有效地迭代行 - How to iterate over rows effectively in pandas data-frame using python PySpark:如何从Spark数据框架创建嵌套的JSON? - PySpark: How to create a nested JSON from spark data frame? 嗨,Python初学者,尝试打印以下内容: - Hi, am new to Python and trying to print the following: 我正在尝试 os.walk。 并创建一个进入excel的数据框 - I am trying to os.walk. and create a data frame that goes into excel 我正在尝试在数据框中获取替换字符串值的 output,但它没有改变 - I am trying to get an output of a replaced string value in a Data frame, but its not changing 将 SQL 代码转换为 PySpark 的问题; 我在哪里用 groupby 和 count 创建一个新的 DF - Issue converting SQL code to PySpark; Where I am creating a new DF with a groupby and count
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM