简体   繁体   中英

How to run word2vec on Windows using gensim

A couple of years ago, a previous developer for my team wrote the following Python code calling word2vec, passing in a training file and the location of an output file. He worked on Linux. I have been asked to get this running on a Windows machine. Bearing in mind I know next to no Python , I have installed Gensim which I'm guessing implements word2vec now, but do not know how to rewrite the code to use the library rather than the executable which it doesnt seem possible to compile on a Windows box. Could someone help me update this code please?

#!/usr/bin/env python3

import os
import csv
import subprocess
import shutil

from gensim.models import word2vec

def train_word2vec(trainFile, output):
    # run word2vec:
    subprocess.run(["word2vec", "-train", trainFile, "-output", output,
                    "-cbow", "0", "-window", "10", "-size", "100"],
                   shell=False)
    # Remove some invalid unicode:
    with open(output, 'rb') as input_,\
         open('%s.new' % output, 'w') as new_output:
        for line in input_:
            try:
                print(line.decode('utf-8'), file=new_output, end='')
            except UnicodeDecodeError:
                print(line)
                pass
    shutil.move('%s.new' % output, output)

def main():
    train_word2vec("c:/temp/wc/test1_BigF.txt", "c:/temp/wc/test1_w2v_model.txt")

if __name__ == '__main__':
    main()

I think the core of what you're after looks something like this:

import sys

from gensim.models.word2vec import Word2Vec

def train_word2vec(trainFile, output):
    # compile word arrays for each sentence of input vocab
    sentences = list(line.split() for line in open(trainFile))

    # effective executable invocation of original code (included for reference)
    # word2vec -train {trainFile} -output {output} -cbow 0 -window 10 -size 100

    # invocation via word2vec module with (mostly) equivalent params
    model = Word2Vec(sentences, size=100, window=10, min_count=1, workers=4)

    # save generated model        
    model.save(output)

if __name__ == '__main__':
    train_word2vec(sys.argv[1], sys.argv[2])

Save as train.py and invoke as follows:

python train.py input.txt output.txt

A few things to note:

  • There's different capitalisation used for names of the module ( word2vec ) and the imported class ( Word2Vec ). It will break if you mix them up.
  • I've not found/included an equivalent for the command line -cbow 0 argument. I'd guess this indicates a preference for the Skip-gram algorithm over CBOW, but would need someone with more gensim experience than me to advise on its ramifications - or indeed those of leaving it out.
  • Nor have I included (or attempted to reproduce) the Unicode removal logic of the original. The generated model output is largely binary data, so taken 'as is' it (a) falls over pretty much straight away and (b) leaves me rather in the dark as to what it's even trying to achieve.

Hope this helps a little anyway.

First things first, you need either you posted incomplete code or your script is missing the following part which enables it to take arguments from command line (add it at the bottom of the script):

if __name__ == '__main__':
    import sys
    train_word2vec(sys.argv[1], sys.argv[2])

Then run the script ( Python is interpreted, not compiled) in a command line in (approximately) the following way:

python.exe your_script_file.py pathToInput pathToOutput

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM