[英]Spark Difference in repartition and spark.sql.shuffle.partition
I am running a spark program with --conf spark.sql.shuffle.partitions=100
我正在使用
--conf spark.sql.shuffle.partitions=100
运行spark程序
Inside application I have the following 内部应用程序我有以下内容
Dataset<Row> df_partitioned = df.repartition(df.col("enriched_usr_id"));
df_partitioned = df_partitioned.sortWithinPartitions(df_partitioned.col("transaction_ts"));
df_partitioned.mapPartitions(
SparkFunctionImpl.mapExecuteUserLogic(), Encoders.bean(Transformed.class));
I have around 5 million users and I want to sort data for every user and execute some logic per user. 我有大约500万用户,我想为每个用户排序数据并为每个用户执行一些逻辑。
My question is does this partition the data into 5 million partitions or 100 partitions and how does the execution work per user. 我的问题是将数据划分为500万个分区还是100个分区,每个用户的执行工作如何?
df.repartition(df.col("enriched_usr_id"))
将使用df.repartition(df.col("enriched_usr_id"))
将数据划分为100个分区( spark.sql.shuffle.partitions
),这意味着多个用户将位于同一分区中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.