简体   繁体   中英

How to access/use Google's pre-trained Word2Vec model without manually downloading the model?

I want to analyse some text on a Google Compute server on Google Cloud Platform (GCP) using the Word2Vec model.

However, the un-compressed word2vec model from https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ is over 3.5GB and it will take time to download it manually and upload it to a cloud instance.

Is there any way to access this (or any other) pre-trained Word2Vec model on a Google Compute server without uploading it myself?

You can also use Gensim to download them through the downloader api:

import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)

or from the command line:

python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)

for a list of available datasets check: https://github.com/RaRe-Technologies/gensim-data

Alternative to manually downloading stuff, you can use the pre-packaged version (third-party not from Google) on Kaggle dataset.

First sign up for Kaggle and get the credentials https://github.com/Kaggle/kaggle-api#api-credentials

Then, do this on the command line:

pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip

Finally, in Python:

import pickle 
import numpy as np
import os

home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
    tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]

Full example on Colab: https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl


P/S: To access other pre-packed word embeddings, see https://github.com/alvations/vegetables

The following code will do the job on Colab (or any other Jupyter notebook) in about 10 sec:

result = !wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p'
code = result[-1]
arg =' --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" -O GoogleNews-vectors-negative300.bin.gz' % code
!wget $arg

If you need it inside python script, replace wget requests with requests library:

import requests
import re 
import shutil

url1 = 'https://docs.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'
resp = requests.get(url1)
code = re.findall('.*confirm=([0-9A-Za-z_]+).*', str(resp.content))
url2 = "https://docs.google.com/uc?export=download&confirm=%s&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" % code[0]
with requests.get(url2, stream=True, cookies=resp.cookies) as r:
    with open('GoogleNews-vectors-negative300.bin.gz', 'wb') as f:
        shutil.copyfileobj(r.raw, f)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM