简体   繁体   English

将 word2vec output 转换为 sklearn 的 dataframe

[英]Converting word2vec output into dataframe for sklearn

I am attempting to use gensim's word2vec to transform a column of a pandas dataframe into a vector that I can pass to a sklearn classifier to make a prediction.我正在尝试使用gensim 的 word2vec将 pandas dataframe 的列转换为可以传递给sklearn分类器进行预测的向量。

I understand that I need to average the vectors for each row.我知道我需要对每一行的向量进行平均。 I have tried following this guide but I am stuck, as I am getting models back but I don't think I can access the underlying embeddings to find the averages.我已经尝试按照本指南进行操作,但我被卡住了,因为我正在取回模型,但我认为我无法访问底层嵌入来找到平均值。

Please see a minimal, reproducible example below:请看下面一个最小的、可重现的例子

import pandas as pd, numpy as np
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.feature_extraction.text import CountVectorizer

temp_df = pd.DataFrame.from_dict({'ID': [1,2,3,4,5], 'ContData': [np.random.randint(1, 10 + 1)]*5, 
                                'Text': ['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit.', 'Sed elementum ultricies varius.',
                                         'Nunc vel risus sed ligula ultrices maximus id qui', 'Pellentesque pellentesque sodales purus,'],
                                'Class': [1,0,1,0,1]})
temp_df['text_lists'] = [x.split(' ') for x in temp_df['Text']]

w2v_model = Word2Vec(temp_df['text_lists'].values, min_count=1)

cv = CountVectorizer()
count_model = pd.DataFrame(data=cv.fit_transform(temp_df['Text']).todense(), columns=list(cv.get_feature_names_out()))

Using sklearn's CountVectorizer , I am able to get a simple frequency representation that I can pass to a classifier.使用sklearn's CountVectorizer ,我可以获得一个简单的频率表示,我可以将其传递给分类器。 How can I get that same format using Word2vec?我如何使用 Word2vec 获得相同的格式?

This toy example produces:这个玩具示例产生:

adipiscing  amet    consectetur dolor   elementum   elit    id  ipsum   ligula  lorem   ... purus   qui risus   sed sit sodales ultrices    ultricies   varius  vel
0   0   1   0   1   0   0   0   1   0   1   ... 0   0   0   0   1   0   0   0   0   0
1   1   0   1   0   0   1   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
2   0   0   0   0   1   0   0   0   0   0   ... 0   0   0   1   0   0   0   1   1   0
3   0   0   0   0   0   0   1   0   1   0   ... 0   1   1   1   0   0   1   0   0   1
4   0   0   0   0   0   0   0   0   0   0   ... 1   0   0   0   0   1   0   0   0   0

While this runs without error, I cannot access the embedding that I can pass with this current format.虽然这运行没有错误,但我无法访问我可以使用当前格式传递的嵌入。 I would like to produce the same format, with the exception of instead of there being counts, its the word2vec value embeddings我想生成相同的格式,除了不是计数,而是word2vec值嵌入

While yo might not be able to help it if your original data comes from a Pandas DataFrame , neither Gensim nor Scikit-Learn work with DataFrame -style data natively.如果您的原始数据来自 Pandas DataFrame ,您可能无能为力,但 Gensim 和 Scikit-Learn 本身都不能处理DataFrame样式的数据。 Rather, they tend to use raw numpy arrays, or base Python datastructures like list s or iterable sequences.相反,他们倾向于使用原始numpy arrays 或基本 Python 数据结构,如list或可迭代序列。

Trying to shoehorn interim raw vectors into the Pandas style of data structure tends to add code complication & wasteful overhead.试图将临时原始向量硬塞进 Pandas 风格的数据结构中往往会增加代码的复杂性和浪费的开销。

That's especially true if the vectors are dense vectors, where essentially all of a smaller-number of dimensions are nonzero, as in word2vec-like algorthms.如果向量是密集向量,则尤其如此,其中基本上所有较小数量的维度都是非零的,就像在类似 word2vec 的算法中一样。 But that's also true if the vectors are the kinds of sparse vectors, with a giant number of dimensions, but most dimensions 0, that come from CountVectorizer and various "bag-of-words"-style text models.但是,如果向量是稀疏向量的种类,具有大量维度,但大多数维度为 0,则来自CountVectorizer和各种“词袋”式文本模型。

So first, I'd recommend against putting the raw outputs of Word2Vec or CountVectorizer , which are usually interim representations on the way to completing some other task, into a DataFrame .因此,首先,我建议不要Word2VecCountVectorizer的原始输出(通常是完成其他任务的过程中的临时表示)放入DataFrame中。

If you want to have the final assigned-labels in the DataFrame , for analysis or reporting in the Pandas style, only add those final outputs in the end.如果您想在DataFrame中获得最终分配的标签,以便以 Pandas 样式进行分析或报告,只需在最后添加这些最终输出。 But to understand the interim vector representations, and then to pass them to things like Scikit-Learn classifiers in the formats those classes expect, keep those vectors (and inspect them yourself for clarity) in the their raw numpy vector formats.但是要理解中间向量表示,然后以这些类期望的格式将它们传递给诸如 Scikit-Learn 之类的分类器,请以原始numpy向量格式保留这些向量(并自己检查它们以确保清晰度)。

In particular, after Word2Vec runs (with the parameters you've shown), there'll be a 100-dimensional vector per word .特别是,在Word2Vec运行后(使用您显示的参数),每个单词将有一个 100 维向量。 Not per multi-word text.不是每个多词文本。 And the 100-dimensions have no names other than their indexes 0 to 99.并且 100 维除了索引 0 到 99 之外没有其他名称。

And unlike the dimensions of the CountVectorizer representation, which are counts of individual words, each dimension of the "dense embedding" will be some floating-point decimal value that has no clear or specific interpretation alone: it's only directions/neighborhoods in the whole space, shearing across many dimensions, that vaguely correspond with useful or human-nameable concepts.CountVectorizer表示的维度不同,它是单个单词的计数,“密集嵌入”的每个维度将是一些浮点十进制值,单独没有明确或具体的解释:它只是整个空间中的方向/邻域,跨越多个维度,模糊地对应于有用的或人类可命名的概念。

If you want to turn the per-word 100-dimensional vectors into vectors for a multi-word text, there are many potential ways to do so – but one simple choice is to simply average together the N word-vectors into 1 summary vector.如果你想将每个词的 100 维向量转换为多词文本的向量,有很多潜在的方法可以做到这一点——但一个简单的选择是简单地将 N 个词向量平均在一起成为 1 个摘要向量。 Gensim's class holding the word-vectors inside the Word2Vec model, KeyedVectors , has a .get_mean_vector() method that can help. Gensim 的 class 在Word2Vec model KeyedVectors中保存词向量,有一个.get_mean_vector()方法可以提供帮助。 For example:例如:

texts_as_wordlists = [x.split(' ') for x in temp_df['Text']]
text_vectors = [w2v_model.wv.get_mean_vector(wordlist) for wordlist in texts_as_wordlists]

There are many other potential ways to use word-vectors to model a longer text.还有许多其他潜在的方法可以使用词向量来 model 更长的文本。 For example, you might reweight the words before averagine.例如,您可以重新加权 averagine 之前的单词。 But a simple average is a reasonable first baseline approach.但简单的平均数是合理的第一基线方法。 (Other algorithms related to word2vec, like the 'Paragraph Vector' algorihtm implemented by the Doc2Vec class, can also create a vector for a multi-word text, and such a vector is not just the average of its word-vectors.) (与 word2vec 相关的其他算法,如Doc2Vec class 实现的“段落向量”算法,也可以为多词文本创建一个向量,并且这样的向量不仅仅是其词向量的平均值。)

Two other notes on using Word2Vec :关于使用Word2Vec的另外两个注意事项:

  • word2vec vectors only get good when trained on lots of word-usage data. word2vec 向量只有在对大量单词使用数据进行训练时才会变得很好。 Toy-sized examples trained on only hundreds, or even tens-of-thousands, of words rarely show anything useful, or anything resembling the power of this algorithm on larger data set.仅在数百甚至数万个单词上训练的玩具大小的示例很少显示出任何有用的信息,或者任何类似于该算法在更大数据集上的强大功能的信息。
  • min_count=1 is essentially always a bad idea with this algorithm.使用此算法, min_count=1本质上总是一个坏主意。 Related to the point above, the algorithm needs multiple subtly-contrasting usage examples of any word to have any chance of placing it meaningfully in the shared-coordinate space.与上述观点相关,该算法需要任何单词的多个微妙对比的用法示例,才能有机会将其有意义地放置在共享坐标空间中。 Words with just one, or even a few, usages tend to get awful vectors not generalizable to the word's real meaning as would be evident from a larger sample of its use.只有一个甚至几个用法的词往往会得到糟糕的向量,无法概括为该词的真实含义,这从其使用的更大样本中可以明显看出。 And, in natural-language corpora, such few-example words are very numerous - so they wind up taking a lot of the training time, and achieving their bad representations actually worsens the vectors for surrounding words, that could be better because there are enough training examples.而且,在自然语言语料库中,如此少的示例词非常多——所以它们最终会花费大量的训练时间,而实现它们糟糕的表示实际上会恶化周围词的向量,这可能会更好,因为有足够的训练实例。 So, the best practice with word2vec is usually to ignore the rarest words – train as if they weren't even there.因此,word2vec 的最佳实践通常是忽略最稀有的单词——就好像它们根本不存在一样进行训练。 (The class's default is min_count=5 for good reasons, and if that results in your model missing vectors for words you think you need, get more data showing uses of those words in real contexts, rather than lowering min_count .) (出于充分的原因,该类的默认值是min_count=5 ,如果这导致您的 model 缺少您认为需要的单词的向量,请获取更多显示这些单词在真实上下文中的使用的数据,而不是降低min_count 。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM