简体   繁体   English

如何清除DeepLearning4j Word2Vec中的vocab缓存,以便每次都对其进行重新培训

[英]How to clear vocab cache in DeepLearning4j Word2Vec so it will be retrained everytime

Thanks in advance. 提前致谢。 I am using Word2Vec in DeepLearning4j. 我在DeepLearning4j中使用Word2Vec。

How do I clear the vocab cache in Word2Vec. 如何清除Word2Vec中的vocab缓存。 This is because I want it to retrain on a new set of word patterns every time I reload Word2Vec. 这是因为每次重新加载Word2Vec时,我都希望它在一组新的单词模式上进行重新训练。 For now, it seems that the vocabulary of the previous set of word patterns persists and I get the same result even though I changed my input training file. 就目前而言,尽管更改了输入训练文件,但似乎以前的一组单词模式的词汇仍然存在,并且得到相同的结果。

I try to reset the model, but it doesn't work. 我尝试重置模型,但是它不起作用。 Codes:- 代码: -

Word2Vec vec = new Word2Vec.Builder() .minWordFrequency(1) .iterations(1) .layerSize(4) .seed(1) .windowSize(1) .iterate(iter) .tokenizerFactory(t) .resetModel(true) .limitVocabularySize(1) .build(); Word2Vec vec =新Word2Vec.Builder().minWordFrequency(1).iterations(1).layerSize(4).seed(1).windowSize(1).iterate(iter).tokenizerFactory(t).resetModel(true)。 limitVocabularySize(1).build();

Anyone can help? 有人可以帮忙吗?

If you want to retrain (this is called training ), I understand that you just want to completely ignore previous learned model (vocabulary, words vector, ...). 如果您想进行再训练(这称为训练 ),那么我了解到您只想完全忽略先前学习的模型(词汇,单词向量等)。 To do that you should create another Word2Vec object and fit it with new data. 为此,您应该创建另一个Word2Vec对象,并使其适合新数据。 You should use an other instance for SentenceIterator and Tokenizer classes so. 您应该为SentenceIteratorTokenizer类使用其他实例。 Your problem could be the way you change your input training files. 您的问题可能是更改输入训练文件的方式。

It should be ok if you just change the SentenceIterator , ie : 如果只更改SentenceIterator ,那就没问题 ,即:

SentenceIterator iter = new CollectionSentenceIterator(DataFetcher.getFirstDataset());
Word2Vec vec = new Word2Vec.Builder()
            .iterate(iter)
            ....
            .build();

vec.fit();

vec.wordsNearest("clear", 10); // you will see results from first dataset

SentenceIterator iter2 = new CollectionSentenceIterator(DataFetcher.getSecondDataset());
vec =  new Word2Vec.Builder()
    .iterate(iter2)
    ....
    .build();

vec.fit();

vec.wordsNearest("clear", 10); // you will see results from second dataset, without any first dataset implication

If you run the code twice and you changed your input data between executions (let's say A and then B) you shouldn't have the same results. 如果您运行两次代码,并且在两次执行之间更改了输入数据(比如说A,然后是B),那么您应该不会得到相同的结果。 If so that's mean your model learned the same thing with input data A and B. 如果是这样,则意味着您的模型使用输入数据A和B学习了相同的东西。

If you want to update training (this is called inference ), I mean use previous learned model and new data to update this model, then you should use this example from dl4j examples. 如果您想更新培训(这就是所谓的推断 ),我的意思是使用以前学习的模型和新的数据来更新这个模型,那么你应该使用这个例子从dl4j例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM