简体   繁体   English

如何在测试数据上拟合 Word2Vec?

[英]How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem.我正在研究情绪分析问题。 I am using Gensim's Word2Vec to vectorize my data in the following way:我正在使用 Gensim 的 Word2Vec 通过以下方式对我的数据进行矢量化:

# PREPROCESSING THE DATA

# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)

train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()

# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]

# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index

# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}

# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)

The new dataframe is the vectorized form of the train data.新的 dataframe 是训练数据的矢量化形式。 Now how do I do the same process for the test data?现在我如何对测试数据进行相同的处理? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage.我无法将整个语料库(训练+测试)传递给 Word2Vec 实例,因为它可能会导致数据泄漏。 Should I simply pass the test list to another instance of the model as:我是否应该简单地将测试列表传递给 model 的另一个实例:

model = Word2Vec(test_x3, min_count = 1)

I dont think so this would be the correct way.我不认为这是正确的方法。 Any help is appreciated!任何帮助表示赞赏!

PS: I am not using the pretrained word2vec in an LSTM model. PS:我没有在 LSTM model 中使用预训练的 word2vec。 What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM.我正在做的是在我拥有的数据上训练 Word2Vec,然后将其输入到 RF 或 LGBM 等 ML 算法中。 Hence I need to vectorize the test data separately.因此,我需要分别对测试数据进行矢量化。

Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it/ That includes some of those with known labels that you're witthiolding from other steps as test/validation records.请注意,由于 word2vec 是一种无监督算法,因此有时可以使用所有可用的文本来训练它/这包括一些具有已知标签的文本,这些标签是您从其他步骤中获取的,作为测试/验证记录。

You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors.您只需确保标签本身不在训练数据中,但仍然使用大量未标记文本来进一步无监督地改进原始词向量。 Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.这些向量受所有输入文本(但没有已知答案标签)的影响,然后用于增强文本的特征建模,作为后续监督标签感知步骤的输入。

(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?) (这是否适合您的项目可能取决于您希望合理估计各种准确性/等评估措施的未来性能。是否总是必须从头开始训练所有内容以及相关的原始文本和标签作为训练的新情况数据都是稀缺的?或者语料库总是在增长,文本总是很丰富的情况,即使标签很昂贵,或者任何实际部署的分类器都能够在进行预测之前利用其他未标记的文本?)

But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space.但也要注意:词向量只有在一起训练到共享空间时才能相互比较兼容。 (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates. (或者,通过其他不太常见的训练后 alginment 步骤使其兼容。)对于在同一 session 中训练的所有内容而言,任何单词的向量都没有一个正确的位置,只有一个很好的相对 position——在两个初始化中都使用了随机化,和训练,因此即使在相同的训练数据上重复运行,也可以产生具有大致等效用处的最终模型,但单词坐标却大相径庭。

So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.因此,当从初始 word2vec 训练中保留您的测试集文本时,您可能不会只在测试文本上训练单独的 word2vec model,而是使用训练数据中的冻结 word2vec model。

Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value.另外:对于 word2vec 模型来说, min_count=1几乎总是一个坏主意,如果您想这样做,那么对于这种需要大量数据的算法来说,您可能拥有的数据太少而无法显示其真正价值。 (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.) (如果在它真正发挥作用的数据集上使用它,您应该更频繁地将该阈值提高到其默认值之上——丢弃更多稀有词——而不是降低它以保存每个稀有、难以建模的词。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM