How to save word vectors in spacy

Question

I have the following code. The goal is to get a vector representation of each word in the list. My intention is to use these word vectors for other application purpose like word clustering.

import numpy as np
import pandas as pd
from sklearn.preprocessing import normalize
import en_vectors_web_lg
nlp = en_vectors_web_lg.load() 

def vectorize(text):
    return nlp(text, disable=['parser', 'tagger', 'ner']).vector

category=['Dell','Python','Asus','Apple','C','perl','Java','iphone','nokia','LG','Lenovo']
for ntext in category:
    doc = nlp(ntext)

    vectors = normalize(np.stack(vectorize(t) for t in doc.text))

I realize i am doing something wrong in the code above. How to save the word vectors of each word in the list 'category'

Answer 1

I haven't seen much documentation on using the en_vectors_web_lg model but I do know en_core_web_lg comes with vectors along with other functionality.

This is how you can vectorize each word/term in a list:

import spacy

nlp = spacy.load('en_core_web_lg')

category=['Dell','Python','Asus','Apple','C','perl','Java','iphone','nokia','LG','Lenovo']
doc = list(nlp.pipe(category, disable=['parser', 'tagger', 'ner']))
vectors = [term.vector for term in doc]

Each vector will look like below (300d):

[-0.94557    0.46092    0.43141   -0.52199    0.55764    0.18107
  0.45607    0.031909   0.097713   0.061064   0.061381  -0.37256
 -0.21712   -0.065784  -0.4061    -0.11485   -0.48388    1.5697
  ...
  0.03717   -0.6773    -0.19379    0.31747   -0.19495    0.37144  ]

You might also be interested in vector_norm : The L2 norm of the token's vector (the square root of the sum of the values squared)

The vector norm for 'dell' would be 8.001050178690836

spaCy also has a built-in cosine similarity method .similarity() to compare vectors.

How to save word vectors in spacy

Question

1 answers

solution1
1 ACCPTED 2020-08-12 22:10:01

How to save word vectors in spacy

Question

1 answers

solution1 1 ACCPTED 2020-08-12 22:10:01

solution1
1 ACCPTED 2020-08-12 22:10:01