简体   繁体   中英

Dumping spark word2vec vectors to a file

I am using spark mllib to generate word vectors. I wish to fit all my data and then get the trained word vectors and dump them to a file.

I am doing this :

JavaRDD<List<String>> data = javaSparkContext.parallelize(streamingData, partitions);
Word2Vec word2vec = new Word2Vec();
Word2VecModel model = word2vec.fit(data);

So, if my training data had sentences like

I love Spark

I want to save the output in files as :

I       0.03 0.53 0.12...
love    0.31 0.14 0.12...
Spark   0.41 0.18 0.84...

After training, I am getting the vectors from the model object like this

Map<String, float[]> wordMap = JavaConverters.mapAsJavaMapConverter(model.getVectors()).asJava();
List<String> wordvectorlist = Lists.newArrayList();
for (String s : wordMap.keySet()) {
    StringBuilder wordvector = new StringBuilder(s);
    for (float f : wordMap.get(s)) {
        wordvector.append(" " + f);
    }
    wordvectorlist.add(wordvector.toString());
    if (wordvectorlist.size() > 1000000) {
        writeToFile(wordvectorlist);
        wordvectorlist.clear();
    }

}

I will be generating these word vectors for a very huge data (~1.5 TB) and thus, I might not be able to save the returned object Word2VecModel in memory of my driver. How can I store this wordvectors map as a rdd so that I can write to files without storing the full map in driver memory?

I looked into word2vec implementation of deeplearning4j but that implementation also requires loading all the vectors in driver memory.

Word2VecModel has a save function which saves it to disk in its own format This will create a directory called data with parquet files of the data and a metadata file with human readable metadata.

You can now read the parquet file and convert it yourself or instead do spark.read.parquet to read it to dataframe. Each line would contain some of the map and you can write it any way you wish.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM