使用 Apache Spark 2.0.0 和 mllib 进行分布式 Word2Vec 模型训练

Question

I have been experimenting with spark and mllib to train a word2vec model but I don't seem to be getting the performance benefits of distributed machine learning on large datasets.我一直在尝试使用 spark 和 mllib 来训练 word2vec 模型，但我似乎没有在大型数据集上获得分布式机器学习的性能优势。 My understanding is that if I have w workers, then, if I create an RDD with n number of partitions where n>w and I try to create a Word2Vec Model by calling the fit function of Word2Vec with the RDD as parameter then spark would distribute the data uniformly to train separate word2vec models on these w workers and use some sort of a reducer function at the end to create a single output model from these w models.我的理解是，如果我有 w 个工人，那么，如果我创建一个具有 n 个分区的 RDD，其中 n>w 并且我尝试通过使用 RDD 作为参数调用 Word2Vec 的拟合函数来创建 Word2Vec 模型，然后 spark 将分发数据统一地在这些 w 个 worker 上训练单独的 word2vec 模型，并在最后使用某种 reducer 函数从这些 w 个模型创建单个输出模型。 This would reduce the computation time as rather than 1 chunk, w chunks of data will be processed simultaneously.这将减少计算时间，因为将同时处理 w 个数据块，而不是 1 个块。 The trade-off would be that some loss of precision might happen depending upon the reducer function used at the end.权衡是根据最后使用的 reducer 函数可能会发生一些精度损失。 Does Word2Vec in Spark actually work this way or not? Spark 中的 Word2Vec 是否真的以这种方式工作？ I might need to play with the configurable parameters if this is indeed the case.如果确实如此，我可能需要使用可配置的参数。

EDIT编辑

Adding the reason behind asking this question.添加提出这个问题的原因。 I ran java spark word2vec code on 10 worker machines and set suitable values for executor-memory, driver memory and num-executors, after going though the documentation, for a 2.5gb input text file which was mapped to rdd partitions which were then used as training data for an mllib word2vec model.我在 10 台工作机器上运行了 java spark word2vec 代码，并在浏览文档后为 2.5gb 输入文本文件设置了合适的执行程序内存、驱动程序内存和 num-executors 值，该文件被映射到 rdd 分区，然后用作mllib word2vec 模型的训练数据。 The training part took multiple hours.培训部分耗时数小时。 The number of worker nodes doesn't seem to be having much of an effect on the training time.工作节点的数量似乎对训练时间没有太大影响。 The same code runs successfully on smaller data files (of the order of 10s of MBs)相同的代码在较小的数据文件上成功运行（大约 10 兆字节）

Code代码

SparkConf conf = new SparkConf().setAppName("SampleWord2Vec");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(new Class[]{String.class, List.class});
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<List<String>> jrdd = jsc.textFile(inputFile, 3).map(new Function<String, List<String>>(){            
        @Override
        public List<String> call(String s) throws Exception {
            return Arrays.asList(s.split(","));
        }        
});
jrdd.persist(StorageLevel.MEMORY_AND_DISK());
Word2Vec word2Vec = new Word2Vec()
      .setWindowSize(20)
      .setMinCount(20);

Word2VecModel model = word2Vec.fit(jrdd);
jrdd.unpersist(false);
model.save(jsc.sc(), outputfile);
jsc.stop();
jsc.close();

Answer 1

Judging from the comments, answers and downvotes I guess I wasn't able to frame my question correctly.从评论、答案和反对票来看，我想我无法正确地提出我的问题。 But the answer to what I wanted to know is yes, it is possible to train your word2vec model in parallel on spark.但是我想知道的答案是肯定的，可以在 spark 上并行训练 word2vec 模型。 The pull request for this feature was created long time back:此功能的拉取请求是很久以前创建的：

https://github.com/apache/spark/pull/1719 https://github.com/apache/spark/pull/1719

In java, there is a setter method (setNumPartitions) for the Word2Vec object in spark mllib.在java中，spark mllib中有一个Word2Vec对象的setter方法（setNumPartitions）。 This allows you to train your word2vec model on more than one executor in parallel.这允许您在多个执行器上并行训练 word2vec 模型。 As per the comments on the pull request mentioned above:根据对上述拉取请求的评论：

" To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed. " “为了使我们的实现更具可扩展性，我们分别训练每个分区，并在每次迭代后合并每个分区的模型。为了使模型更准确，可能需要多次迭代。 ”

Hope this helps someone.希望这可以帮助某人。

Answer 2

I don't see anything inherently wrong with your code.我看不出你的代码有什么本质上的错误。 I would highly recommend you consider the data frames API, however.但是，我强烈建议您考虑使用数据框 API。 As an example, here's a little chart that is frequently thrown around:例如，这里有一个经常被抛出的小图表：

Also, I don't know how you may have been "iterating" over elements of the data frame (that's not really how they work).另外，我不知道您是如何“迭代”数据框的元素的（这实际上并不是它们的工作方式）。 Here's an example from the Spark online docs :这是Spark 在线文档中的一个示例：

You have the general idea...but you have to parallelize your data as a data frame, first.你有一般的想法......但你必须首先将你的数据并行化为一个数据框。 It is quite trivial to translate your javardd to a DataFrame instead.将您的 javardd 转换为 DataFrame 非常简单。

DataFrame fileDF = sqlContext.createDataFrame(jrdd, Model.class);

Spark runs a Directed Acyclic Graph (DAG) in lieu of MR, but the concept is the same. Spark 运行有向无环图 (DAG) 来代替 MR，但概念是相同的。 Running 'fit() on your data will indeed run across the data on the workers and then reduce to a single model.对您的数据运行'fit()确实会遍历工作人员的数据，然后减少到单个模型。 But this model will be itself distributed in memory until you decide to write it down.但是这个模型本身会分布在内存中，直到你决定把它写下来。

But, as a trial, how long would it take you to run the same file through say NLTK or Word2Vec's native C++ binary?但是，作为试验，通过 NLTK 或 Word2Vec 的本机 C++ 二进制文件运行同一个文件需要多长时间？

One last thought...is there a reason you are persisting to memory AND disk?最后一个想法......您是否有理由坚持使用内存和磁盘？ Spark has a native .cache() that persists to memory by default. Spark 有一个原生的.cache() ，默认情况下会持久化到内存中。 The power of Spark is to do machine learning on data held in memory...BIG data in memory. Spark 的强大之处在于对内存中的数据进行机器学习......内存中的大数据。 If you persist to disk, even with kryo you are creating a bottleneck at disk I/O.如果您坚持使用磁盘，即使使用 kryo，您也会在磁盘 I/O 上造成瓶颈。 IMHO the first thing to try would be to get rid of this and persist just to memory.恕我直言，首先要尝试的是摆脱这个并坚持记忆。 If performance improves, great, but you will find leaps and bounds of performance by leaning on the power of Catalyst through DataFrames.如果性能有所提高，那就太好了，但是您会发现通过 DataFrames 依靠 Catalyst 的强大功能来实现性能的飞跃。

One thing we didn't discuss is your cluster.我们没有讨论的一件事是您的集群。 It would be helpful to think about things like how much memory per node you have...how many cores per node...is your cluster virtualized with other apps that are asking for resources (over provisioned like most vHosts)... Is your cluster in the cloud?考虑每个节点有多少内存......每个节点有多少核心......你的集群是否与其他需要资源的应用程序虚拟化（像大多数虚拟主机一样过度配置）......是有帮助的您在云中的集群？ Shared or dedicated?共享还是专用？

Have you looked at Spark's UI to analyze the runtime operations of the code?你有没有看过Spark的UI来分析代码的运行时操作？ What do you see when you run top on the workers while the model is fitting?当你在模型拟合的时候跑到工人top ，你会看到什么？ Can you see full CPU utilization?你能看到完整的 CPU 使用率吗？ Have you tried specifying --executor-cores to make sure you making full use of CPU?您是否尝试过指定--executor-cores以确保充分利用 CPU？

I've seen it happen many times that all the work is being done on one core on one worker node.我已经多次看到所有工作都在一个工作节点上的一个核心上完成。 It would be helpful to have this info.拥有这些信息会很有帮助。

When troubleshooting performance, there are many places to look, including the Spark config files themselves!在对性能进行故障排除时，有很多地方可以查看，包括 Spark 配置文件本身！

使用 Apache Spark 2.0.0 和 mllib 进行分布式 Word2Vec 模型训练

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-09-30 11:11:54

解决方案2
0 2016-09-29 00:51:56

使用 Apache Spark 2.0.0 和 mllib 进行分布式 Word2Vec 模型训练

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-09-30 11:11:54

解决方案2 0 2016-09-29 00:51:56

解决方案1
4 已采纳 2016-09-30 11:11:54

解决方案2
0 2016-09-29 00:51:56