简体   繁体   English

将Spark Word2Vec矢量倾销到文件中

[英]Dumping spark word2vec vectors to a file

I am using spark mllib to generate word vectors. 我正在使用spark mllib生成单词向量。 I wish to fit all my data and then get the trained word vectors and dump them to a file. 我希望拟合所有数据,然后获取经过训练的单词向量并将其转储到文件中。

I am doing this : 我正在这样做:

JavaRDD<List<String>> data = javaSparkContext.parallelize(streamingData, partitions);
Word2Vec word2vec = new Word2Vec();
Word2VecModel model = word2vec.fit(data);

So, if my training data had sentences like 所以,如果我的训练资料中有这样的句子

I love Spark

I want to save the output in files as : 我想将输出保存为以下文件:

I       0.03 0.53 0.12...
love    0.31 0.14 0.12...
Spark   0.41 0.18 0.84...

After training, I am getting the vectors from the model object like this 训练后,我从这样的模型对象中获取向量

Map<String, float[]> wordMap = JavaConverters.mapAsJavaMapConverter(model.getVectors()).asJava();
List<String> wordvectorlist = Lists.newArrayList();
for (String s : wordMap.keySet()) {
    StringBuilder wordvector = new StringBuilder(s);
    for (float f : wordMap.get(s)) {
        wordvector.append(" " + f);
    }
    wordvectorlist.add(wordvector.toString());
    if (wordvectorlist.size() > 1000000) {
        writeToFile(wordvectorlist);
        wordvectorlist.clear();
    }

}

I will be generating these word vectors for a very huge data (~1.5 TB) and thus, I might not be able to save the returned object Word2VecModel in memory of my driver. 我将为大量数据(约1.5 TB)生成这些字向量,因此,我可能无法将返回的对象Word2VecModel保存在驱动程序的内存中。 How can I store this wordvectors map as a rdd so that I can write to files without storing the full map in driver memory? 如何将这个wordvector映射存储为rdd,这样就可以写入文件而无需将完整的映射存储在驱动程序内存中?

I looked into word2vec implementation of deeplearning4j but that implementation also requires loading all the vectors in driver memory. 我研究了deeplearning4j的word2vec实现,但是该实现还需要将所有向量加载到驱动程序内存中。

Word2VecModel has a save function which saves it to disk in its own format This will create a directory called data with parquet files of the data and a metadata file with human readable metadata. Word2VecModel具有保存功能,可以将其以自己的格式保存到磁盘中。这将创建一个名为data的目录,该目录包含该数据的拼写文件和一个具有人类可读的元数据的元数据文件。

You can now read the parquet file and convert it yourself or instead do spark.read.parquet to read it to dataframe. 现在,您可以读取实木复合地板文件并自己进行转换,也可以执行spark.read.parquet将其读取到数据帧中。 Each line would contain some of the map and you can write it any way you wish. 每行将包含一些地图,您可以按照自己的方式编写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM