简体   繁体   English

平均来自语料库的向量

[英]Averaging Vectors from a Corpus

How could I use the code below to go through a folder of documents and get each documents vector value, and then average the overall value? 我如何使用下面的代码浏览文档文件夹并获取每个文档的矢量值,然后对总值求平均值?

documents_list = ['Hello, world','Here are two sentences.']
for doc in documents_list:
    doc_nlp = nlp(doc)
    print(doc_nlp.vector)
    for token in doc_nlp:
        print(token.text,token.vector)

It seems like you are wanting to get average vectors on a sentence level, but your example is showing a token level vector representation. 似乎您希望获得句子级别的平均向量,但是您的示例显示了令牌级别的向量表示形式。

Sentence level 句子水平

Averaging sentence vectors could be done in the following way: 平均句子向量可以通过以下方式完成:

>>> import numpy as np
>>> np.array([nlp(doc).vector for doc in documents_list]).mean(axis=0)

This would return a single averaged vector for all sentences in documents_list 这将为documents_list所有句子返回一个平均向量

Token level 代币级别

You could achieve the same on a token level by doing the following: 您可以通过执行以下操作在令牌级别实现相同的目的:

>>> [np.array([token.vector for token in nlp(doc)]).mean(axis=0) for doc in documents_list]

This will give you a list of averaged word vectors across tokens for each sentence. 这将为您提供每个句子的跨标记平均单词向量的列表。 Basically a list of vectors of length len(documents_list) 基本上是长度为len(documents_list)的向量的列表

Side note 边注

As a side note, averaging vectors does not really preserve semantic structure as it implicitly makes the claim that the local context is equivalent to it's broader context. 附带说明一下,平均向量并不能真正保留语义结构,因为它隐含地宣称局部上下文与它的更广泛上下文等效。 Concatenating might be a better choice in a smaller windowed context. 在较小的窗口环境中,串联可能是更好的选择。

Make sure to test the results for your domain and task, it could work well for your task depending on your assumptions. 确保测试您的域和任务的结果,根据您的假设,它可以很好地适合您的任务。

我不确定文档的含义(我不熟悉spacy),但是如果您想要平均值,则可以将每个向量添加到列表中,然后在for循环之后执行:

avg = sum(vectors_list) / len(vectors_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM