简体   繁体   中英

How to prepare data for word2vec in gensim and fasttext?

I want to train word2vec and fasttext to get vectors for a specific dataset that I have.

What should my model take as input?

My file is like this:

Customer_4: I want to book a ticket to New York.
Agent_9: Okay, when do you want the tickets for
Customer_4: hmm, wait a sec
Agent_9: Sure
Customer_4: When is the least expensive to fly

Now, How should I prepare my data for word2vec to run? Does the word2vec model take inter sentence similaarity into account, ie should i not prepare the corpus sentence wise.

One way would be that you first split your document into lines, then for each line, split the line into tokens. Then you end up with a corpus of list of list of tokens. After that, you can feed it into the gensim word2vec model.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM