Spark数据集自定义分区程序

Question

Could you please help me to find Java API for repartitioning sales dataset to N patitions of equal-size? 您能否帮我找到Java API，用于将sales数据集重新分配给N个等大小的？ By equal-size I mean equal number of rows. 等大小我的意思是相等的行数。

Dataset<Row> sales = sparkSession.read().parquet(salesPath);
sales.toJavaRDD().partitions().size(); // returns 1

Answer 1

AFAIK custom partitioners are not supported for Datasets. 数据集不支持AFAIK自定义分区程序。 The whole idea of Dataset and Dataframe APIs in Spark 2+ is to abstract away the need to meddle with custom partitioners. Spark 2+中的数据集和数据帧API的整体思想是抽象出需要干涉自定义分区程序。 And so if we face the need to deal with Data-skew and come to a point where custom partitioner is the only option, I guess we would go to lower level RDD manipulation. 因此，如果我们面临处理数据偏差的需要并且达到自定义分区器是唯一选项的程度，我想我们会去更低级别的RDD操作。

For eg: Facebook use-case-study and Spark summit talk related to the use-case-study 例如： Facebook用例研究和Spark峰会谈话与用例研究有关

For defining partitioners for RDDs, it is well documented in the API doc 为了定义RDD的分区程序，它在API文档中有详细记录

Spark数据集自定义分区程序

问题描述

1 个解决方案

解决方案1
3 2017-02-24 04:34:34

Spark数据集自定义分区程序

问题描述

1 个解决方案

解决方案1 3 2017-02-24 04:34:34

解决方案1
3 2017-02-24 04:34:34