Averaging Vectors from a Corpus

Question

How could I use the code below to go through a folder of documents and get each documents vector value, and then average the overall value?

documents_list = ['Hello, world','Here are two sentences.']
for doc in documents_list:
    doc_nlp = nlp(doc)
    print(doc_nlp.vector)
    for token in doc_nlp:
        print(token.text,token.vector)

Answer 1

It seems like you are wanting to get average vectors on a sentence level, but your example is showing a token level vector representation.

Sentence level

Averaging sentence vectors could be done in the following way:

>>> import numpy as np
>>> np.array([nlp(doc).vector for doc in documents_list]).mean(axis=0)

This would return a single averaged vector for all sentences in documents_list

Token level

You could achieve the same on a token level by doing the following:

>>> [np.array([token.vector for token in nlp(doc)]).mean(axis=0) for doc in documents_list]

This will give you a list of averaged word vectors across tokens for each sentence. Basically a list of vectors of length len(documents_list)

Side note

As a side note, averaging vectors does not really preserve semantic structure as it implicitly makes the claim that the local context is equivalent to it's broader context. Concatenating might be a better choice in a smaller windowed context.

Make sure to test the results for your domain and task, it could work well for your task depending on your assumptions.

Answer 2

我不确定文档的含义（我不熟悉spacy），但是如果您想要平均值，则可以将每个向量添加到列表中，然后在for循环之后执行：

avg = sum(vectors_list) / len(vectors_list)

Averaging Vectors from a Corpus

Question

2 answers

solution1
1 2018-03-24 11:57:11

Sentence level

Token level

Side note

solution2
0 2018-03-24 07:01:51

Averaging Vectors from a Corpus

Question

2 answers

solution1 1 2018-03-24 11:57:11

Sentence level

Token level

Side note

solution2 0 2018-03-24 07:01:51

solution1
1 2018-03-24 11:57:11

solution2
0 2018-03-24 07:01:51