简体繁体 English

使用预训练的Word2Vec模型进行情感分析

[英]Using pretrained Word2Vec model for sentiment analysis

原文 2018-12-05 20:24:07 5 1 python/ twitter/ nlp/ word2vec/ sentiment-analysis

I am using a pretrained Word2Vec model for tweets to create vectors for each word. 我正在使用预训练的Word2Vec模型进行推文，以为每个单词创建向量。 https://www.fredericgodin.com/software/ . https://www.fredericgodin.com/software/ 。 I will then compute the average of this and use a classifier to determine sentiment. 然后，我将计算平均值，并使用分类器确定情绪。

My training data is very large and the pretrained Word2Vec model has been trained on millions of tweets, with dimensionality = 400. My problem is that it is taking too long to give vectors to the words in my training data. 我的训练数据非常大，预训练的Word2Vec模型已经在数百万条推文上进行了训练，维数=400。我的问题是，要给我的训练数据中的单词提供矢量花费的时间太长。 Is there a way to reduce the time taken to build the word vectors? 有没有一种方法可以减少构建单词向量所需的时间？

Cheers. 干杯。

1 个解决方案

It's unclear what you mean by "too long". 不清楚“太长”是什么意思。

Looking up individual word-vectors from a pre-existing model should be very fast: it's a simple in-memory lookup of the word to the array index (from a dict), then an access of that array-index. 从预先存在的模型中查找单个单词向量应该非常快：这是一个简单的内存中单词查找（从dict到数组索引），然后访问该数组索引。

If it's slow for you, perhaps you've loaded a model larger than your available RAM? 如果对您来说很慢，也许您加载的模型大于可用的RAM？ In that case, operation might be relying on much-slower virtual memory (paging working memory to and from slower disk). 在这种情况下，操作可能依赖于慢得多的虚拟内存（在工作内存与慢速磁盘之间来回分页）。 With these kinds of models, where access is very random across locations, you never ever want to do this. 对于这些类型的模型，在各个位置之间的访问都是非常随机的，您永远都不会想要这样做。 If it's happening, you should get more RAM or use a smaller model. 如果发生这种情况，您应该获得更多的RAM或使用较小的模型。