[英]pyspark using one task for mapPartitions when converting rdd to dataframe
I'm confused as to why it appears that Spark is using 1 task for rdd.mapPartitions
when converting the resulting RDD to a DataFrame. 我很困惑为什么在将生成的RDD转换为
rdd.mapPartitions
时,Spark正在为rdd.mapPartitions
使用1个任务。
This is an issue for me because I would like to go from : 这对我来说是一个问题,因为我想从:
DataFrame
--> RDD
--> rdd.mapPartitions
--> DataFrame
DataFrame
- > RDD
- > rdd.mapPartitions
- > DataFrame
so that I can read in data (DataFrame), apply a non-SQL function to chunks of data (mapPartitions on RDD) and then convert back to a DataFrame so that I can using the DataFrame.write
process. 这样我就可以读取数据(DataFrame),将非SQL函数应用于数据块(RDD上的mapPartitions),然后转换回DataFrame,以便我可以使用
DataFrame.write
进程。
I am able to go from DataFrame --> mapPartitions and then use an RDD writer like saveAsTextFile but that is less than ideal since the DataFrame.write
process can do things like overwrite and save data in Orc format. 我可以从DataFrame - > mapPartitions转到然后使用类似saveAsTextFile的RDD编写器,但这不太理想,因为
DataFrame.write
进程可以执行覆盖和以Orc格式保存数据之类的操作。 So I'd like to learn why this is going on, but from a pratical perspective I'm primarily concerned with being able to just go from a DataFrame --> mapParitions --> to using the DataFrame.write process. 所以我想了解为什么会这样,但从实际的角度来看,我主要关心的是能够从DataFrame - > mapParitions - >到使用DataFrame.write进程。
Here is a reproducible example. 这是一个可重复的例子。 The following works as expected, with 100 tasks for the
mapPartitions
work: 以下按预期工作,
mapPartitions
有100个任务:
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession \
.builder \
.master("yarn-client") \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
df = pd.DataFrame({'var1':range(100000),'var2': [x-1000 for x in range(100000)]})
spark_df = spark.createDataFrame(df).repartition(100)
def f(part):
return [(1,2)]
spark_df.rdd.mapPartitions(f).collect()
However if the last line is change to something like spark_df.rdd.mapPartitions(f).toDF().show()
then there will only be one task for the mapPartitions
work. 但是,如果最后一行改为
spark_df.rdd.mapPartitions(f).toDF().show()
那么mapPartitions
只会有一个任务。
DataFrame.show()
only shows the first number of rows of your dataframe, by default only the first 20. If that number is smaller than the number of rows per partition, Spark is lazy and only evaluates a single partition, which is equivalent to a single task. DataFrame.show()
仅显示数据帧的第一行数,默认情况下仅显示前20行。如果该数字小于每个分区的行数,则Spark是惰性的,仅评估单个分区,这相当于一项任务。
You can also do collect
on a dataframe, to compute and collect all partitions and see 100 tasks again. 您还可以对数据帧进行
collect
,计算和收集所有分区,并再次查看100个任务。
You will still see the runJob
task first as before, which is caused by the toDF
call to be able to determine the resulting dataframe's schema: it needs to process a single partition to be able to determine the output types of your mapping function. 您仍将像以前一样首先看到
runJob
任务,这是由toDF
调用引起的,以便能够确定结果数据帧的模式:它需要处理单个分区以便能够确定映射函数的输出类型。 After this initial stage the actual action such as collect
will happen on all partitons. 在这个初始阶段之后,诸如
collect
之类的实际行动将在所有分区上发生。 For instance, for me running your snippet with the last line replaced with spark_df.rdd.mapPartitions(f).toDF().collect()
results in these stages: 例如,对于我运行你的代码片段,最后一行替换为
spark_df.rdd.mapPartitions(f).toDF().collect()
产生以下阶段:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.