简体   繁体   English

分区和spark.sql.shuffle.partition中的Spark差异

[英]Spark Difference in repartition and spark.sql.shuffle.partition

I am running a spark program with --conf spark.sql.shuffle.partitions=100 我正在使用--conf spark.sql.shuffle.partitions=100运行spark程序

Inside application I have the following 内部应用程序我有以下内容

Dataset<Row> df_partitioned = df.repartition(df.col("enriched_usr_id"));
df_partitioned = df_partitioned.sortWithinPartitions(df_partitioned.col("transaction_ts"));
df_partitioned.mapPartitions(
    SparkFunctionImpl.mapExecuteUserLogic(), Encoders.bean(Transformed.class));

I have around 5 million users and I want to sort data for every user and execute some logic per user. 我有大约500万用户,我想为每个用户排序数据并为每个用户执行一些逻辑。

My question is does this partition the data into 5 million partitions or 100 partitions and how does the execution work per user. 我的问题是将数据划分为500万个分区还是100个分区,每个用户的执行工作如何?

df.repartition(df.col("enriched_usr_id"))将使用df.repartition(df.col("enriched_usr_id"))将数据划分为100个分区( spark.sql.shuffle.partitions ),这意味着多个用户将位于同一分区中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM