[英]Efficient implementation of SOM (Self organizing map) on Pyspark
I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.我正在努力在 Spark / Pyspark 上为具有 > 100 个特征的巨大数据集实现 SOM 批处理算法的高性能版本。 I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.我有一种感觉,我可以使用 RDD,我可以/必须自己指定并行化,或者我使用 Dataframe,它应该具有更高的性能,但我看不出如何在使用时为每个工人使用局部累积变量之类的东西数据帧。
Ideas:想法:
Any thoughts on the different options?对不同的选择有什么想法吗? Is there an even better option?还有更好的选择吗?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.或者所有的想法都不是那么好,我应该只预先选择我的数据集的最大多样性子集并在本地训练一个 SOM。 Thanks!谢谢!
This is exactly what I have done last year, so I might be in a good position to give you an answer.这正是我去年所做的,所以我可以很好地给你一个答案。
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).首先,这是我的批处理 SOM 算法的 Spark 实现(它是用 Scala 编写的,但大多数事情在 Pyspark 中是相似的)。
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:我的一个项目需要这个算法,我发现的每个实现都至少有以下两个问题或限制之一:
fit()
/ transform()
API operating over DataFrames.他们没有使用 DataFrame API(纯 RDD API),也没有 Spark ML/MLlib 精神,即使用简单的fit()
/ transform()
API 在 DataFrame 上运行。So, there I went on to code it myself: the batch SOM algorithm in Spark ML style.所以,我继续自己编写代码:Spark ML 风格的批处理 SOM 算法。 The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm.我做的第一件事是查看如何在 Spark ML 中实现 k-means,因为如您所知,批处理 SOM 与 k-means 算法非常相似。 Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.实际上,我可以重用大部分 Spark ML k-means 代码,但我不得不修改核心算法和超参数。
I can summarize quickly how the model is built:我可以快速总结一下模型是如何构建的:
SOMParams
class, containing the SOM hyperparameters (size, training parameters, etc.)一个SOMParams
类,包含 SOM 超参数(大小、训练参数等)SOM
class, which inherits from spark's Estimator
, and contains the training algorithm.一个SOM
类,它继承自 spark 的Estimator
,并包含训练算法。 In particular, it contains a fit()
method that operates on an input DataFrame
, where features are stored as a spark.ml.linalg.Vector
in a single column.特别是,它包含一个fit()
方法,该方法对输入DataFrame
,其中特征存储为spark.ml.linalg.Vector
中的spark.ml.linalg.Vector
。 fit()
will then select this column and unpack the DataFrame
to obtain the unerlying RDD[Vector]
of features, and call the run()
method on it.然后fit()
将选择此列并解压DataFrame
以获得特征的底层RDD[Vector]
,并对其调用run()
方法。 This is where all the computations happen, and as you guessed, it uses RDD
s, accumulators and broadcast variables.这是所有计算发生的地方,正如您猜测的那样,它使用RDD
、累加器和广播变量。 Finally, the fit()
method returns a SOMModel
object.最后, fit()
方法返回一个SOMModel
对象。SOMModel
is a trained SOM model, and inherits from spark's Transformer
/ Model
. SOMModel
是经过训练的 SOM 模型,继承自 spark 的Transformer
/ Model
。 It contains the map prototypes (center vectors), and contains a transform()
method that can operate on DataFrames
by taking an input feature column, and adding a new column with the predictions (projection on the map).它包含地图原型(中心向量),并包含一个transform()
方法,该方法可以通过获取输入特征列并添加具有预测的新列(地图上的投影transform()
来对数据DataFrames
进行操作。 This is done by a prediction UDF.这是由预测 UDF 完成的。SOMTrainingSummary
that collects stuff such as the objective function.还有SOMTrainingSummary
收集诸如目标函数之类的东西。Here are the take-aways:以下是外卖:
RDD
and DataFrame
s (or rather Dataset
s, but the difference between those two is of no real importance here). RDD
和DataFrame
s(或者更确切地说是Dataset
s,但这两者之间的区别在这里并不重要)之间并没有真正的对立。 They are just used in different contexts .它们只是在不同的上下文中使用。 In fact, a DataFrame can be seen as a RDD
specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).实际上,DataFrame 可以看作是专门用于操作按列组织的结构化数据(例如关系表)的RDD
,允许类似 SQL 的操作和执行计划的优化(Catalyst 优化器)。Dataframe
s, always.对于结构化数据,选择/过滤/聚合操作,请始终使用Dataframe
。RDD
API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. ...但是对于更复杂的任务,例如机器学习算法,您需要回到RDD
API 并自己分配计算,使用 map/mapPartitions/foreach/reduce/reduceByKey/等等。 Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!看看在 MLlib 中是如何完成的:它只是 RDD 操作的一个很好的包装器!Hope it will solve your question.希望它能解决你的问题。 Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.关于性能,当你要求一个有效的实现时,我还没有做任何基准测试,但我在工作中使用它,它在几分钟内在生产集群上处理了 500k/1M 行的数据集。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.