简体   繁体   English

Flink自定义分区功能

[英]Flink Custom Partition Function

I am using Scala on Flink with DataSet API. 我在Flink上使用Scala和DataSet API。 I want to re-partition my data across the nodes. 我想在节点之间重新划分数据。 Spark has a function that lets the user to re-partition the data with a given numberOfPartitions parameter ( link ) and I believe Flink does not support such function. Spark有一个函数,允许用户使用给定的numberOfPartitions参数( 链接 )重新分区数据,我相信Flink不支持这样的功能。 Thus, I wanted to achieve this by implementing a custom partitioning function. 因此,我想通过实现自定义分区功能来实现这一点。

My data is of type DataSet(Double,SparseVector) An example line from the data: 我的数据类型为DataSet(Double,SparseVector)来自数据的示例行:

(1.0 SparseVector((2024,1.0), (2025,1.0), (2030,1.0), (2045,1.0), (2046,1.41), (2063,1.0), (2072,1.0), (3031,1.0), (3032,1.0), (4757,1.0), (4790,1.0), (177196,1.0), (177197,0.301), (177199,1.0), (177202,1.0), (1544177,1.0), (1544178,1.0), (1544179,1.0), (1654031,1.0), (1654190,1.0), (1654191,1.0), (1654192,1.0), (1654193,1.0), (1654194,1.0), (1654212,1.0), (1654237,1.0), (1654238,1.0)))

Since my "Double" is binary (1 or -1), I want to partition my data according to the length of the SparceVector. 由于我的“Double”是二进制(1或-1),我想根据SparceVector的长度对数据进行分区。 My custom partitioner is as follows: 我的自定义分区程序如下:

class myPartitioner extends Partitioner[SparseVector]
{ 
    override def partition(key: SparseVector, numPartitions: Int): Int = {
         key.size % numPartitions
    } 
}

I call this custom partitioner as follows: 我将此自定义分区程序称为如下:

data.partitionCustom(new myPartitioner(),1)

Can somebody please help me to understand how to specify number of partitions as "numPartitions" argument when calling myPartitioner function in Scala. 有人可以帮我理解在Scala中调用myPartitioner函数时如何指定分区数为“numPartitions”参数。

Thank you. 谢谢。

In flink you can define setParallelism for a single operator or for all the operators using enviornment.setParallelism . 在弗林克您可以定义setParallelism单个操作员或使用所有运营商enviornment.setParallelism I hope this link will help you. 我希望这个链接可以帮到你。

Spark uses repartition(n: Int) function to redistribute data into n partitions, which will be processed by n tasks. Spark使用repartition(n:Int)函数将数据重新分配到n个分区,这些分区将由n个任务处理。 From my perspective, this includes two changes: data redistribution and number of downstream tasks. 从我的角度来看,这包括两个变化:数据重新分配和下游任务的数量。

Therefore, in Apache Flink, I think that the Partitioner is mapped to data redistribution and the parallelism is mapped to the number of downstream tasks, which means you can use setParallelism to determine the "numPartitions". 因此,在Apache Flink中,我认为分区程序映射到数据重新分配,并行性映射到下游任务的数量,这意味着您可以使用setParallelism来确定“numPartitions”。

I'm assuming you're using the length of the SparseVector just to have something that gives you relatively random values to use for partitioning. 我假设你正在使用SparseVector的长度只是为了给你一些相对随机的值来用于分区。 If that's true, then you can just do a DataSet.rebalance() . 如果这是真的,那么你可以做一个DataSet.rebalance() If you follow that by any operator (including a Sink ) where you set the parallelism to numPartitions , then you should get nicely repartitioned data. 如果您通过任何运算符(包括Sink )将其设置为numPartitions ,那么您应该获得很好的重新分区数据。

But your description of ...want to re-partition my data across the nodes makes me think that you're trying to apply Spark's concept of RDD s to Flink, which isn't really valid. 但是你对...want to re-partition my data across the nodes的描述...want to re-partition my data across the nodes让我觉得你正在尝试将Spark的RDD概念应用到Flink,这不是真正有效的。 Eg assuming you have numPartition parallel operators processing the (re-partitioned) data in your DataSet, then these operators will be running in slots provided by the available TaskManagers, and these slots might or might not be on different physical servers. 例如,假设您有numPartition并行运算符处理DataSet中的(重新分区)数据,那么这些运算符将在可用TaskManagers提供的插槽中运行,并且这些插槽可能位于不同的物理服务器上,也可能不位于不同的物理服务器上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM