[英]Optimizing Flink transformation
I have the following method that computes the probability of a value in a DataSet
: 我有以下方法来计算DataSet
值的概率:
/**
* Compute the probabilities of each value on the given [[DataSet]]
*
* @param x single colum [[DataSet]]
* @return Sequence of probabilites for each value
*/
private[this] def probs(x: DataSet[Double]): Seq[Double] = {
val counts = x.groupBy(_.doubleValue)
.reduceGroup(_.size.toDouble)
.name("X Probs")
.collect
val total = counts.sum
counts.map(_ / total)
}
The problem is that when I submit my flink job, that uses this method, its causing flink to kill the job due to a task TimeOut
. 问题是,当我提交使用此方法的flink作业时,由于任务TimeOut
导致flink杀死了该作业。 I am executing this method for each attribute on a DataSet
with only 40.000 instances and 9 attributes. 我对只有40.000个实例和9个属性的DataSet
上的每个属性执行此方法。
Is there a way I could do this code more efficient? 有什么办法可以使我的代码更有效吗?
After a few tries, I made it work with mapPartition
, this method is part of a class InformationTheory
, which does some computations to calculate Entropy, mutual information etc. So, for example, SymmetricalUncertainty
is computed as this: 经过几次尝试,我使其与mapPartition
一起mapPartition
,该方法是InformationTheory
类的一部分,该类进行一些计算以计算熵,互信息等。因此,例如, SymmetricalUncertainty
的计算方法如下:
/**
* Computes 'symmetrical uncertainty' (SU) - a symmetric mutual information measure.
*
* It is defined as SU(X, y) = 2 * (IG(X|Y) / (H(X) + H(Y)))
*
* @param xy [[DataSet]] with two features
* @return SU value
*/
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.mapPartitionWith {
case in ⇒
val x = in map (_._2)
val y = in map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
Some(2 * mu / (Hx + Hy))
}
su.collect.head.head
}
With this, I can compute efficiently entropy
, mutual information etc. The catch is, it only works with a level of parallelism of 1, the problem resides in mapPartition
. 这样,我可以有效地计算entropy
,互信息等。问题是,它仅在并行度为1的情况下工作,问题出在mapPartition
。
Is there a way I could do something similar to what I am doing here with SymmetricalUncertainty
, but with whatever level of parallelism? 有什么方法可以与我在SymmetricalUncertainty
执行的操作类似,但是可以在任何并行度下进行操作吗?
I finally did it, don't know if its the best solution, but its working with n levels of parallelism: 我终于做到了,不知道它是否是最好的解决方案,但是可以在n个并行级别上工作:
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.reduceGroup { in ⇒
val invec = in.toVector
val x = invec map (_._2)
val y = invec map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
2 * mu / (Hx + Hy)
}
su.collect.head
}
You can check the entire code at InformationTheory.scala , and its tests InformationTheorySpec.scala 您可以在InformationTheory.scala中检查整个代码,并对其进行测试InformationTheorySpec.scala
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.