简体   繁体   English

pyspark在将rdd转换为dataframe时使用mapPartitions的一个任务

[英]pyspark using one task for mapPartitions when converting rdd to dataframe

I'm confused as to why it appears that Spark is using 1 task for rdd.mapPartitions when converting the resulting RDD to a DataFrame. 我很困惑为什么在将生成的RDD转换为rdd.mapPartitions时,Spark正在为rdd.mapPartitions使用1个任务。

This is an issue for me because I would like to go from : 这对我来说是一个问题,因为我想从:

DataFrame --> RDD --> rdd.mapPartitions --> DataFrame DataFrame - > RDD - > rdd.mapPartitions - > DataFrame

so that I can read in data (DataFrame), apply a non-SQL function to chunks of data (mapPartitions on RDD) and then convert back to a DataFrame so that I can using the DataFrame.write process. 这样我就可以读取数据(DataFrame),将非SQL函数应用于数据块(RDD上的mapPartitions),然后转换回DataFrame,以便我可以使用DataFrame.write进程。

I am able to go from DataFrame --> mapPartitions and then use an RDD writer like saveAsTextFile but that is less than ideal since the DataFrame.write process can do things like overwrite and save data in Orc format. 我可以从DataFrame - > mapPartitions转到然后使用类似saveAsTextFile的RDD编写器,但这不太理想,因为DataFrame.write进程可以执行覆盖和以Orc格式保存数据之类的操作。 So I'd like to learn why this is going on, but from a pratical perspective I'm primarily concerned with being able to just go from a DataFrame --> mapParitions --> to using the DataFrame.write process. 所以我想了解为什么会这样,但从实际的角度来看,我主要关心的是能够从DataFrame - > mapParitions - >到使用DataFrame.write进程。

Here is a reproducible example. 这是一个可重复的例子。 The following works as expected, with 100 tasks for the mapPartitions work: 以下按预期工作, mapPartitions有100个任务:

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession \
    .builder \
    .master("yarn-client") \
    .enableHiveSupport() \
    .getOrCreate()

sc = spark.sparkContext

df = pd.DataFrame({'var1':range(100000),'var2': [x-1000 for x in range(100000)]})
spark_df = spark.createDataFrame(df).repartition(100)

def f(part):
    return [(1,2)]

spark_df.rdd.mapPartitions(f).collect()

However if the last line is change to something like spark_df.rdd.mapPartitions(f).toDF().show() then there will only be one task for the mapPartitions work. 但是,如果最后一行改为spark_df.rdd.mapPartitions(f).toDF().show()那么mapPartitions只会有一个任务。

Some screenshots illustrating this below: 一些截图说明如下: 在此输入图像描述 在此输入图像描述

DataFrame.show() only shows the first number of rows of your dataframe, by default only the first 20. If that number is smaller than the number of rows per partition, Spark is lazy and only evaluates a single partition, which is equivalent to a single task. DataFrame.show()仅显示数据帧的第一行数,默认情况下仅显示前20行。如果该数字小于每个分区的行数,则Spark是惰性的,仅评估单个分区,这相当于一项任务。

You can also do collect on a dataframe, to compute and collect all partitions and see 100 tasks again. 您还可以对数据帧进行collect ,计算和收集所有分区,并再次查看100个任务。

You will still see the runJob task first as before, which is caused by the toDF call to be able to determine the resulting dataframe's schema: it needs to process a single partition to be able to determine the output types of your mapping function. 您仍将像以前一样首先看到runJob任务,这是由toDF调用引起的,以便能够确定结果数据帧的模式:它需要处理单个分区以便能够确定映射函数的输出类型。 After this initial stage the actual action such as collect will happen on all partitons. 在这个初始阶段之后,诸如collect之类的实际行动将在所有分区上发生。 For instance, for me running your snippet with the last line replaced with spark_df.rdd.mapPartitions(f).toDF().collect() results in these stages: 例如,对于我运行你的代码片段,最后一行替换为spark_df.rdd.mapPartitions(f).toDF().collect()产生以下阶段:

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM