Pythonic way to obtain a distance matrix from word vectors in gensim 4.0

Question

I am currently using gensim version 4.0.1 to generate word vectors. My ultimate goal is to compute cosine distances between all pairwise combinations word vectors and to use the obtained distance matrix for clustering the word vectors. So far I have been been generating the distance matrix with the following code:

    print('Setting up Word2Vec model')
    model = gensim.models.Word2Vec (genome_tokens, vector_size=100, window=args.window_size, min_count=args.min_cluster_size, workers=args.threads, sg=1)

    print('Training Word2Vec model')
    model.train(genome_tokens,total_examples=len(genome_tokens),epochs=10)

    words = sorted(model.wv.index_to_key)
    scaled_data = [model.wv[w] for w in words]
    print('Calculating distribution distance among clusters')
    cluster_distrib_distance = pairwise_distances(scaled_data, metric=args.metric)

I was wondering if there is a specific function to obtain the distance matrix directly from the model object, without having to create the words and scaled data object.

Going through the gensim documentation I have mostly found information regarding ways to calculate similarities, rather than distances and often between documents rather than individual words. There does seem to be some discussion on this topic on the github repository , but the methods described there seem to be specific to the older versions as is the case for the solution presented here

Answer 1

There's no built-in utility method for that.

But, you can get the raw backing array, with all the vectors in it, in the model.wv.vectors property. Each row is the word-vector for the corresponding word in the same position in index_to_key .

You can feed this into sklearn.metrics.pairwise_distances (or similar) directly, without the need for the separate (& differently-sorted) scaled_data outside.

Note that if using something like Euclidean distance, you might want the word-vectors to be unit-length-normalized before calculating distances. Then all distances will be in the range [0.0, 2.0] , and ranked distances will be the exact reverse of ranked cosine-similarities.

In that case you'd again want to work from an external set of vectors – either by using get_vector(key, norm=True) to get them 1-by-1, or get_normed_vectors() to get a fully unit-normed version of the .vectors array.

Pythonic way to obtain a distance matrix from word vectors in gensim 4.0

Question

1 answers

solution1
1 2021-11-02 17:14:54

Pythonic way to obtain a distance matrix from word vectors in gensim 4.0

Question

1 answers

solution1 1 2021-11-02 17:14:54

solution1
1 2021-11-02 17:14:54