简体   繁体   English

如何更快地将 Word2Vec 预训练的 model 加载到 Gensim 中?

[英]How can a Word2Vec pretrained model be loaded in Gensim faster?

I'm loading the model using:我正在使用以下方法加载 model:

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) 

Now every time i run the file in Pycharm, it loads the model again.现在每次我在 Pycharm 中运行文件时,它都会再次加载 model。

So, is there a way to load it once and be available whenever i run things like model['king'] and model.doesnt_match("house garage store dog".split())那么,有没有一种方法可以加载一次,并且在我运行诸如model['king']model.doesnt_match("house garage store dog".split())之类的东西时可用

because it takes alot of time whenever i wana check the similarity or words that don't match.因为每当我要检查相似性或不匹配的单词时,都会花费很多时间。 When i ran model.most_similar('finance') it was really slow and the whole laptop freezed for like 2 min.当我运行model.most_similar('finance')时,它真的很慢,整个笔记本电脑冻结了大约 2 分钟。 So, is there a way to make things faster, 'cause i wana use it in my project, but i can't let the user wait for this long.那么,有没有办法让事情变得更快,因为我想在我的项目中使用它,但我不能让用户等待这么久。

Any suggestions?有什么建议么?

That's a set of word-vectors that's about 3.6GB on disk, and slightly larger when loaded - so just the disk IO can take a noticeable amount of time.这是一组在磁盘上大约 3.6GB 的词向量,加载时稍大一些 - 因此仅磁盘 IO 可能需要相当长的时间。

Also, at least until gensim-4.0.0 (now available as a beta preview), versions of Gensim through 3.8.3 require an extra one-time pre-calculation of unit-length-normalized vectors upon the very first use of a .most_similar() or .doesnt_match() operation (& others).此外,至少在gensim-4.0.0 (现在可作为 beta 预览版)之前,Gensim 到 3.8.3 的版本需要在第一次使用.most_similar().doesnt_match()操作(及其他)。 This step can also take a noticeable moment, & then immediately requires a few extra GB of memory for a full model like GoogleNews - which on any machine with less thanf about 8GB RAM free risks using slower virtual-memory or even crashing with an out-of-memory error.此步骤也可能需要很长时间,然后立即需要几 GB 的 memory 才能获得完整的 model,例如GoogleNews - 在任何 RAM 少于 8GB 的机器上,使用较慢的虚拟内存甚至崩溃的风险内存错误。 (Starting in gensim-4.0.0beta , once the model loads, the 1st .most_similar() won't need any extra pre-calculation/allocation.) (从gensim-4.0.0beta开始,一旦 model 加载,第一个.most_similar()将不需要任何额外的预计算/分配。)

The main way to avoid this annoying lag is to structure your code or service to not reload it separately before each calculation.避免这种烦人的延迟的主要方法是构建您的代码或服务,以便在每次计算之前单独重新加载它。 Typically, this means keeping an interactive Python process that's loaded it alive, ready for your extra operations (or later user requests, as might be the case with a web-deployed service.)通常,这意味着保持一个已加载的交互式 Python 进程处于活动状态,为您的额外操作(或以后的用户请求,可能是 Web 部署服务的情况)做好准备。

It sounds like you may be developing a single Python script, something like mystuff.py , and running it via PyCharm's execute/debug/etc utilities for launching a Python file.听起来您可能正在开发单个 Python 脚本,例如mystuff.py ,并通过 PyCharm 的执行/调试/等实用程序运行它以启动 Python 文件。 Unfortunately, upon each completed execution, that will let the whole Python process end, releasing any loaded data/objects completely.不幸的是,在每次完成执行时,这将使整个 Python 进程结束,完全释放任何加载的数据/对象。 Running the script again must do all the loading/precalculation again.再次运行脚本必须再次执行所有加载/预计算。

If your main interest is doing a bit of investigational examination & experimentation with the set of word-vectors, on your own, a big improvement would be to move to an interactive environment that keeps a single Python run alive & waiting for your next line of code.如果您的主要兴趣是自己对一组词向量进行一些调查性检查和实验,那么一个很大的改进是转移到一个交互式环境,该环境可以保持单个 Python 运行并等待您的下一行代码。

For example, if you run the ipython interpreter at a command-line, in a separate shell, you can load the model, do a few lookup/similarity operations to print the results, and then just leave the prompt waiting for your next code.例如,如果您在命令行中运行ipython解释器,在单独的 shell 中,您可以加载 model,执行一些查找/相似性操作以打印结果,然后让提示等待您的下一个代码。 The full loaded state of the process remains available until you choose to exit the interpreter.在您选择退出解释器之前,该进程的满载 state 仍然可用。

Similarly, if you use a Jupyter Notebook inside a web-browser, you get that same interpreter experience inside a growing set of editable-code-and-result 'cells' that you can re-run.同样,如果您在 Web 浏览器中使用 Jupyter Notebook,您将在不断增长的可重新运行的可编辑代码和结果“单元”中获得相同的解释器体验。 All are sharing the same back-end interpreter process, with persistent state – unless you choose to restart the 'kernel'.所有人都共享相同的后端解释器进程,并具有持久的 state - 除非您选择重新启动“内核”。

If you're providing a script or library code for your users' investigational work, they could also use such persistent interpreters.如果您为用户的调查工作提供脚本或库代码,他们也可以使用此类持久解释器。

But if you're building a web service or other persistently-running tool, you'd similarly want to make sure that the model remains loaded between user requests.但是,如果您正在构建 web 服务或其他持续运行的工具,您同样需要确保 model 在用户请求之间保持加载。 (Exactly how you'd do that would depend on the details of your deployment, including web server software, so it'd be best to ask/search-for that as a separate question supplying more details when you're at that step.) (具体如何操作取决于您的部署细节,包括 web 服务器软件,因此最好将其作为一个单独的问题询问/搜索,以提供更多详细信息。 )

There is one other trick that may help in your constant-relaunch scenario.还有另一个技巧可能有助于您不断重新启动的情况。 Gensim can save & load in its own native format, which can make use of 'memory-mapping'. Gensim 可以以自己的本机格式保存和加载,这可以利用“内存映射”。 Essentially, a range of a file on-disk can be used directly by the operating-system's virtual memory system.本质上,磁盘上的文件范围可以直接由操作系统的虚拟 memory 系统使用。 Then, when many processes all designate the same file as the canonical version of something they want in their own memory-space, the OS knows they can re-use any parts of that file that are already in memory.然后,当许多进程都在自己的内存空间中将相同的文件指定为他们想要的东西的规范版本时,操作系统知道他们可以重用该文件中已经在 memory 中的任何部分。

This technique works far more simply in the `gensim-4.0.0beta' and later, so I'm only going to describe the steps needed there.这种技术在 `gensim-4.0.0beta' 和更高版本中更简单,所以我只描述那里所需的步骤。 (Seethis message if you want to force this preview installation before Gensim 4.0 is officially released.) (如果您想在 Gensim 4.0 正式发布之前强制进行此预览安装,请参阅此消息。)

First, load the original-format file, but then re-save it in Gensim's format:首先,加载原始格式的文件,然后以 Gensim 的格式重新保存它:

from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) 
kv_model.save('GoogleNews-vectors-negative300.kv')

Note that there will be an extra .npv file created that must be kept alongside the GoogleNews-vectors-negative300.kv if you move the model elsewhere.请注意,如果您将 model 移至其他位置,则会创建一个额外的.npv文件,该文件必须与GoogleNews-vectors-negative300.kv一起保存。 DO this only once to create the new files.只需执行一次即可创建新文件。

Second, when you later need the model, use Gensim's .load() with the mmap option:其次,当您以后需要 model 时,使用 Gensim 的 .load .load()mmap选项:

kv_model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
# do your other operations

Right away, the .load() should complete faster.马上, .load()应该更快地完成。 However, when you 1st try to access any word – or all words in a .most_similar() – the read from disk will still need to happen, just shifting the delays to later.但是,当您第一次尝试访问任何单词时——或.most_similar()中的所有单词——仍然需要从磁盘读取,只是将延迟转移到以后。 (If you're only ever doing individual-word lookups or small sets of .doesnt_match() words, you may not notice any long lags.) (如果您只进行单个单词查找或少量.doesnt_match()单词,您可能不会注意到任何长时间的滞后。)

Further, depending on your OS & amount-of-RAM, you might even get some speedup when you run your script once, let it finish, then run it again soon after.此外,根据您的操作系统和 RAM 量,您甚至可以在运行一次脚本时获得一些加速,让它完成,然后很快再次运行它。 It's possible in some cases that even though the OS has ended the prior process, its virtual-memory machinery remembers that some of the not-yet-cleared old-process memory pages are still in RAM, & correspond to the memory-mapped file.在某些情况下,即使操作系统已经结束了先前的进程,它的虚拟内存机制也可能会记住一些尚未清除的旧进程 memory 页面仍在 RAM 中,并且对应于内存映射文件。 Thus, the next memory-map will re-use them.因此,下一个内存映射将重用它们。 (I'm not sure of this effect, and if you're in a low-memory situation the chance of such re-use from a completed may disappear completely. (我不确定这种效果,如果您处于记忆力低下的情况,那么从已完成内容中重复使用的机会可能会完全消失。

But, you could increase the chances of the model file staying memory-resident by taking a third step: launch a separate Python process to preload the model that doesn't exit until killed.但是,您可以通过采取第三步来增加 model 文件保持内存驻留的机会:启动单独的 Python 进程以预加载 model 直到退出。 To do this, make another Python script like preload.py :为此,请制作另一个 Python 脚本,例如preload.py

from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-negative300.kv', mmap='r')
model.most_similar('stuff')  # any word will do: just to page all in
Semaphore(0).acquire()  # just hang until process killed

Run this script in a separate shell: python preload.py .在单独的 shell: python preload.py中运行此脚本。 It will map the model into memory, but then hang until you CTRL-C exit it.它将 map 将 model 转换为 memory,然后挂起直到您CTRL-C退出它。

Now, any other code you run on the same machine that memory-maps the same file will automatically re-use any already-loaded memory pages from this separate process.现在,您在同一台机器上运行的任何其他内存映射同一文件的代码将自动重新使用来自此单独进程的任何已加载的 memory 页面。 (In low-memory conditions, if any other virtual-memory is being relied upon, ranges could still be flushed out of RAM. But if you have plentiful RAM, this will ensure minimal disk IO each new time the same file is referenced.) (在内存不足的情况下,如果依赖任何其他虚拟内存,范围仍可能从 RAM 中清除。但如果您有充足的 RAM,这将确保每次新引用同一文件时磁盘 IO 最小。)

Finally, one other option that can be mixed with any of these is to load only a subset of the full 3-million-token, 3.6GB GoogleNews set.最后,可以与其中任何一个混合的另一个选项是仅加载完整的 300 万令牌、3.6GB 的GoogleNews集的子集。 The less-common words are near the end of this file, and skipping them won't affect many uses.不太常见的词在这个文件的末尾附近,跳过它们不会影响很多用途。 So you can use the limit argument of load_word2vec_format() to only load a subset - which loads faster, uses less memory, and completes later full-set searches (like .most_similar() ) faster.因此,您可以使用load_word2vec_format()limit参数仅加载一个子集 - 加载速度更快,使用更少的 memory,并更快地完成以后的全集搜索(如.most_similar() )。 For example, to load just the 1st 1,000,000 words for about 67% savings of RAM/load-time/search-time:例如,仅加载第一个 1,000,000 个字,从而节省大约 67% 的 RAM/加载时间/搜索时间:

from gensim.models import KeyedVectors
kv_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',  limit=1000000, binary=True) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM