简体   繁体   中英

How to extract vectors from a Word2Vec Model for clustering

I have a word2vec model that I have trained. The model is based on ~95,000 word corpus. I would like to select only those words in the corpus that are , for example, adjectives. For this, let's assume I have adj , a list of all adjectives in the corpus. I want them to keep their embeddings from the over all corpus, but I want to extract these vectors and preform some sort of cluster analysis on just the adjectives.

From what I understand, if I have X , which is a vocabulary from a word2vec model, I can extract the vectors of all adjectives like so

adj = [ 'x', 'y', 'z']
X = model1[model1.wv.vocab]
adjvsm = []
for i in adj:
    adjvsm.append([i, X[i]])

This will create the following list:

adjsvm[1]
['x', array([ 1.0772455 ,  0.481113  , -0.19076753, -0.31512445,  2.700769], dtype=float32)]

Normally if I want to cluster word2vec model I'd do the following:

kclusterer = KMeansClusterer(some_number_of_cluster, distance=nltk.cluster.util.cosine_distance, repeats=25)

assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
print(assigned_clusters)

Which would produce a list like the following if I specified a binary cluster:

x = 1
y = 0
z = 0

Of course, this doesn't work. One problem I can find is that I'm pulling from a numpy array and putting into a list, which kclustering doesn't use. It liked numpy arrays.

My question is how do I extract a set (based on a list of word IDs) of vectors from a word2vec model while keeping them a numpy array and keeping the link between the word ID (eg 'y') and the embeddings?

You already know the answer. Build a numpy array.

X = np.array([model1[word] for word in adj])

Maybe you can even do simply

X = model1[adj]

Building mixed-data arrays as you did is ineffective.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM