简体   繁体   English

如何使用 deeplearning4java 训练 Word2vec model 后更新词汇表

[英]how to update vocabulary after training Word2vec model using deeplearning4java

I used deeplearning4j library to do vectorization task using word2vec.我使用 deeplearning4j 库使用 word2vec 执行矢量化任务。 I need to vectorize new word after training the model using a specific corpus.在使用特定语料库训练 model 后,我需要对新词进行矢量化。 so how to add this new word and update the training to get a new weight vector for the new word?那么如何添加这个新词并更新训练以获得新词的新权重向量呢? my code as folows:我的代码如下:

Test word2Vec = new Test();
   word2Vec.train();
    
    //test the generated trained file
    Word2Vec word2VecModel = WordVectorSerializer.readWord2VecModel(new File(word2Vec.modelFilePath));
   
    
    double cosi=word2VecModel.similarity("httpdbpediaorgresourcethe_terminator", "httpdbpediaorgresourceterminator_salvation");
    System.out.println(cosi);

  }
  
  public  void train() throws IOException {
        SentenceIterator sentenceIterator = new FileSentenceIterator(new File(inputFilePath));

        TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
        tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());

        Word2Vec vec = new Word2Vec.Builder()
                .layerSize(100)
                .windowSize(5)
                .epochs(5) //3-50 https://arxiv.org/pdf/1301.3781.pdf
                .elementsLearningAlgorithm(new SkipGram<VocabWord>())
                .iterate(sentenceIterator)
                .tokenizerFactory(tokenizerFactory)
                .build();
        vec.fit();
      
        WordVectorSerializer.writeWordVectors(vec, modelFilePath);
      
    }

Deeplearning4j is capable of updating the weights but won't add new words. Deeplearning4j 能够更新权重,但不会添加新词。 Generally it's not recommended to do that.一般不建议这样做。 Adding only singular sentences doesn't really allow for good training of good word vectors.只添加单数句子实际上并不能很好地训练好的词向量。

The training doesn't normally take too long anyways.无论如何,培训通常不会花费太长时间。 I would recommend just training the weights with new words.我建议只用新词训练权重。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM