简体   繁体   中英

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.

I know that for a RDD, while creating it we can mention the number of partitions like below.

val RDD1 = sc.textFile("path" , 6) 

But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.

Only possibility i think is, after creating dataframe we can use repartition API.

df.repartition(4)

So can anyone please let me know if we can specify the number of partitions while creating a dataframe.

You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions .

In general:

  • Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism .
  • Datasets created from RDD inherit number of partitions from its parent.
  • Datsets created using data source API:

  • Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.

Default number of shuffle partitions in spark dataframe(200)

Default number of partitions in rdd(10)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM