简体   繁体   中英

Pythonic way to obtain a distance matrix from word vectors in gensim 4.0

I am currently using gensim version 4.0.1 to generate word vectors. My ultimate goal is to compute cosine distances between all pairwise combinations word vectors and to use the obtained distance matrix for clustering the word vectors. So far I have been been generating the distance matrix with the following code:

    print('Setting up Word2Vec model')
    model = gensim.models.Word2Vec (genome_tokens, vector_size=100, window=args.window_size, min_count=args.min_cluster_size, workers=args.threads, sg=1)

    print('Training Word2Vec model')
    model.train(genome_tokens,total_examples=len(genome_tokens),epochs=10)

    words = sorted(model.wv.index_to_key)
    scaled_data = [model.wv[w] for w in words]
    print('Calculating distribution distance among clusters')
    cluster_distrib_distance = pairwise_distances(scaled_data, metric=args.metric)

I was wondering if there is a specific function to obtain the distance matrix directly from the model object, without having to create the words and scaled data object.

Going through the gensim documentation I have mostly found information regarding ways to calculate similarities, rather than distances and often between documents rather than individual words. There does seem to be some discussion on this topic on the github repository , but the methods described there seem to be specific to the older versions as is the case for the solution presented here

There's no built-in utility method for that.

But, you can get the raw backing array, with all the vectors in it, in the model.wv.vectors property. Each row is the word-vector for the corresponding word in the same position in index_to_key .

You can feed this into sklearn.metrics.pairwise_distances (or similar) directly, without the need for the separate (& differently-sorted) scaled_data outside.

Note that if using something like Euclidean distance, you might want the word-vectors to be unit-length-normalized before calculating distances. Then all distances will be in the range [0.0, 2.0] , and ranked distances will be the exact reverse of ranked cosine-similarities.

In that case you'd again want to work from an external set of vectors – either by using get_vector(key, norm=True) to get them 1-by-1, or get_normed_vectors() to get a fully unit-normed version of the .vectors array.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM