简体   繁体   中英

How to implement Word2Vec in Java?

I installed word2Vec using this tutorial on by Ubuntu laptop. Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java? I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.

Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.


Also, I've looked into using GloVe, however, I do not have MatLab. Is it possible to use GloVe without MatLab? (I got an error while installing it because of this). If so, the same question as above goes... I have no idea how to implement it in Java.

What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?

Some code snippets:

// Class to store a hashmap of wordvecs
public class WordVecs {

    HashMap<String, WordVec> wordvecmap;
    ....
    void loadFromTextFile() {
        String wordvecFile = prop.getProperty("wordvecs.vecfile");
        wordvecmap = new HashMap();
        try (FileReader fr = new FileReader(wordvecFile);
            BufferedReader br = new BufferedReader(fr)) {
            String line;

            while ((line = br.readLine()) != null) {
                WordVec wv = new WordVec(line);
                wordvecmap.put(wv.word, wv);
            }
        }
        catch (Exception ex) { ex.printStackTrace(); }        
    }
    ....
}

// class for each wordvec
public class WordVec implements Comparable<WordVec> {
    public WordVec(String line) {
        String[] tokens = line.split("\\s+");
        word = tokens[0];
        vec = new float[tokens.length-1];
        for (int i = 1; i < tokens.length; i++)
            vec[i-1] = Float.parseFloat(tokens[i]);
        norm = getNorm();
    }
    ....
}

If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.

Dl4j author here. Our word2vec implementation is targeted for people who need to have custom pipelines. I don't blame you for going the simple route here.

Our word2vec implementation is meant for when you want to do something with them not for messing around. The c word2vec format is pretty straight forward.

Here is parsing logic in java if you'd like: https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113

Hope that helps a bit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM