简体   繁体   中英

How to use vectors from Doc2Vec in Tensorflow

I am trying to use Doc2Vec to convert sentences to vectors, then use those vectors to train a tensorflow classifier.

I am a little confused at what tags are used for, and how to extract all of the document vectors from Doc2Vec after it has finished training.

My code so far is as follows:

fake_data = pd.read_csv('./sentences/fake.txt', sep='\n')
real_data = pd.read_csv('./sentences/real.txt', sep='\n')
sentences = []

for i, row in fake_data.iterrows():
    sentences.append(TaggedDocument(row['title'].lower().split(), ['fake', len(sentences)]))

for i, row in real_data.iterrows():
    sentences.append(TaggedDocument(row['title'].lower().split(), ['real', len(sentences)]))

model = gensim.models.Doc2Vec(sentences)

I get vectors when I do print(model.docvecs[1]) etc, but they are different every time I remake the model.

First of all: have I used Doc2Vec correctly? Second: Is there a way I can grab all documents tagged 'real' or 'fake', then turn them into a numpy array and pass it into tensorflow?

I believe the tag that you use for each TaggedDocument is not what you expect. Doc2Vec algorithm is learning vector representations of the specified tags (some of which can be shared between the documents). So if your goal is simply to convert sentences to vectors, the recommended choice of a tag is some kind of unique sentence identifier, such as sentence index.

The learned model is then stored in model.docvecs . Eg, if you use sentence index as a tag, you can then get the 1st document vector by accessing model.docvecs for the tag "0" , the second document - for the tag "1" , and so on.

Example code:

documents = [doc2vec.TaggedDocument(sentence, ['real-%d' % i])
             for i, sentence in enumerate(sentences)]
model = doc2vec.Doc2Vec(documents, vector_size=10)  # 10 is just for illustration

# Raw vectors are stored in `model.docvecs.vectors_docs`.
# It's easier to access each one by the tag, which are stored in `model.docvecs.doctags`.
for tag in model.docvecs.doctags.keys():
  print(tag, model.docvecs[tag])  # Prints the learned numpy array for this tag

By the way, to control the model randomness, use seed parameter of Doc2Vec class.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM