简体   繁体   English

如何在 Java 中实现 Word2Vec?

[英]How to implement Word2Vec in Java?

I installed word2Vec using this tutorial on by Ubuntu laptop.我在 Ubuntu 笔记本电脑上使用本教程安装了 word2Vec。 Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java?为了在 Java 中实现 word2Vec 向量,是否完全有必要安装DL4J I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.我很习惯在 Eclipse 中工作,但我不确定我是否需要 DL4J 希望我安装的所有其他先决条件。

Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.理想情况下,有一种非常简单的方法可以让我只使用我已经编写的 Java 代码(在 Eclipse 中)并更改几行——这样我正在做的单词查找将检索 word2Vec 向量而不是我正在使用的当前检索过程。


Also, I've looked into using GloVe, however, I do not have MatLab.另外,我已经考虑过使用 GloVe,但是,我没有 MatLab。 Is it possible to use GloVe without MatLab?是否可以在没有 MatLab 的情况下使用 GloVe? (I got an error while installing it because of this). (因此,我在安装时遇到错误)。 If so, the same question as above goes... I have no idea how to implement it in Java.如果是这样,与上述相同的问题是......我不知道如何在 Java 中实现它。

What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?是什么阻止您以文本格式保存 word2vec(C 程序)输出,然后使用一段 Java 代码读取文件并将向量加载到由单词字符串键控的哈希图中?

Some code snippets:一些代码片段:

// Class to store a hashmap of wordvecs
public class WordVecs {

    HashMap<String, WordVec> wordvecmap;
    ....
    void loadFromTextFile() {
        String wordvecFile = prop.getProperty("wordvecs.vecfile");
        wordvecmap = new HashMap();
        try (FileReader fr = new FileReader(wordvecFile);
            BufferedReader br = new BufferedReader(fr)) {
            String line;

            while ((line = br.readLine()) != null) {
                WordVec wv = new WordVec(line);
                wordvecmap.put(wv.word, wv);
            }
        }
        catch (Exception ex) { ex.printStackTrace(); }        
    }
    ....
}

// class for each wordvec
public class WordVec implements Comparable<WordVec> {
    public WordVec(String line) {
        String[] tokens = line.split("\\s+");
        word = tokens[0];
        vec = new float[tokens.length-1];
        for (int i = 1; i < tokens.length; i++)
            vec[i-1] = Float.parseFloat(tokens[i]);
        norm = getNorm();
    }
    ....
}

If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.如果要获得给定单词的最近邻,可以保留与每个 WordVec 对象关联的 N 个最近的预计算邻的列表。

Dl4j author here. DL4j作者在这里。 Our word2vec implementation is targeted for people who need to have custom pipelines.我们的 word2vec 实现面向需要自定义管道的人。 I don't blame you for going the simple route here.我不怪你在这里走简单的路线。

Our word2vec implementation is meant for when you want to do something with them not for messing around.我们的 word2vec 实现是为了当你想用它们做一些事情而不是搞乱时。 The c word2vec format is pretty straight forward. c word2vec 格式非常简单。

Here is parsing logic in java if you'd like: https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113如果您愿意,这里是 java 中的解析逻辑: https : //github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/orgingedembedmodels/java加载程序/WordVectorSerializer.java#L113

Hope that helps a bit希望那有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM