简体繁体 English

在 Pyspark 上高效实现 SOM（自组织映射）

[英]Efficient implementation of SOM (Self organizing map) on Pyspark

原文 2019-02-10 14:26:30 5 1 apache-spark/ parallel-processing/ pyspark/ som

I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.我正在努力在 Spark / Pyspark 上为具有 > 100 个特征的巨大数据集实现 SOM 批处理算法的高性能版本。 I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.我有一种感觉，我可以使用 RDD，我可以/必须自己指定并行化，或者我使用 Dataframe，它应该具有更高的性能，但我看不出如何在使用时为每个工人使用局部累积变量之类的东西数据帧。

Ideas:想法：

Using Accumulators.使用累加器。 Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver.通过创建一个 UDF 来并行计算，该 UDF 将观察作为输入，计算对网络的影响并将影响发送到驱动程序中的累加器。 (Implemented this version already, but seems rather slow (I think accumulator updates take to long)) （已经实现了这个版本，但似乎很慢（我认为累加器更新需要很长时间））
Store results in a new column of Dataframe and then sum it together in the end.将结果存储在 Dataframe 的新列中，然后最后将其相加。 (Would have to store a whole neural net in the each row (eg 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together? （必须在每一行中存储整个神经网络（例如 20*20*130））火花优化算法是否意识到，它不需要保存每个网络而只需将它们相加？
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms).使用类似于以下内容的 RDD 创建自定义并行化算法： https ://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/（但具有更高性能的计算算法）。 But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)但我将不得不使用某种循环来遍历每一行并更新网络 -> 听起来这会很糟糕。）

Any thoughts on the different options?对不同的选择有什么想法吗？ Is there an even better option?还有更好的选择吗？

Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.或者所有的想法都不是那么好，我应该只预先选择我的数据集的最大多样性子集并在本地训练一个 SOM。 Thanks!谢谢！

1 个解决方案

This is exactly what I have done last year, so I might be in a good position to give you an answer.这正是我去年所做的，所以我可以很好地给你一个答案。

First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).首先，这是我的批处理 SOM 算法的 Spark 实现（它是用 Scala 编写的，但大多数事情在 Pyspark 中是相似的）。

I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:我的一个项目需要这个算法，我发现的每个实现都至少有以下两个问题或限制之一：

they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)他们并没有真正实现批量 SOM 算法，而是使用了一种地图平均方法，这给了我奇怪的结果（输出地图中的异常对称）
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, ie with a simple fit() / transform() API operating over DataFrames.他们没有使用 DataFrame API（纯 RDD API），也没有 Spark ML/MLlib 精神，即使用简单的fit() / transform() API 在 DataFrame 上运行。

So, there I went on to code it myself: the batch SOM algorithm in Spark ML style.所以，我继续自己编写代码：Spark ML 风格的批处理 SOM 算法。 The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm.我做的第一件事是查看如何在 Spark ML 中实现 k-means，因为如您所知，批处理 SOM 与 k-means 算法非常相似。 Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.实际上，我可以重用大部分 Spark ML k-means 代码，但我不得不修改核心算法和超参数。

I can summarize quickly how the model is built:我可以快速总结一下模型是如何构建的：

A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)一个SOMParams类，包含 SOM 超参数（大小、训练参数等）
A SOM class, which inherits from spark's Estimator , and contains the training algorithm.一个SOM类，它继承自 spark 的Estimator ，并包含训练算法。 In particular, it contains a fit() method that operates on an input DataFrame , where features are stored as a spark.ml.linalg.Vector in a single column.特别是，它包含一个fit()方法，该方法对输入DataFrame ，其中特征存储为spark.ml.linalg.Vector中的spark.ml.linalg.Vector 。 fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it.然后fit()将选择此列并解压DataFrame以获得特征的底层RDD[Vector] ，并对其调用run()方法。 This is where all the computations happen, and as you guessed, it uses RDD s, accumulators and broadcast variables.这是所有计算发生的地方，正如您猜测的那样，它使用RDD 、累加器和广播变量。 Finally, the fit() method returns a SOMModel object.最后， fit()方法返回一个SOMModel对象。
SOMModel is a trained SOM model, and inherits from spark's Transformer / Model . SOMModel是经过训练的 SOM 模型，继承自 spark 的Transformer / Model 。 It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map).它包含地图原型（中心向量），并包含一个transform()方法，该方法可以通过获取输入特征列并添加具有预测的新列（地图上的投影transform()来对数据DataFrames进行操作。 This is done by a prediction UDF.这是由预测 UDF 完成的。
There is also SOMTrainingSummary that collects stuff such as the objective function.还有SOMTrainingSummary收集诸如目标函数之类的东西。

Here are the take-aways:以下是外卖：

There is not really an opposition between RDD and DataFrame s (or rather Dataset s, but the difference between those two is of no real importance here). RDD和DataFrame s（或者更确切地说是Dataset s，但这两者之间的区别在这里并不重要）之间并没有真正的对立。 They are just used in different contexts .它们只是在不同的上下文中使用。 In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).实际上，DataFrame 可以看作是专门用于操作按列组织的结构化数据（例如关系表）的RDD ，允许类似 SQL 的操作和执行计划的优化（Catalyst 优化器）。
For structured data, select/filter/aggregation operations, DO USE Dataframe s, always.对于结构化数据，选择/过滤/聚合操作，请始终使用Dataframe 。
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. ...但是对于更复杂的任务，例如机器学习算法，您需要回到RDD API 并自己分配计算，使用 map/mapPartitions/foreach/reduce/reduceByKey/等等。 Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!看看在 MLlib 中是如何完成的：它只是 RDD 操作的一个很好的包装器！

Hope it will solve your question.希望它能解决你的问题。 Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.关于性能，当你要求一个有效的实现时，我还没有做任何基准测试，但我在工作中使用它，它在几分钟内在生产集群上处理了 500k/1M 行的数据集。