[英]Mocking FastText model for utest
I am using fasttext models in my python library (from the official fasttext
library).我在我的 python 库中使用 fasttext 模型(来自官方的
fasttext
库)。 To run my u-tests, I need at some point a model ( fasttext.FastText._FastText
object), as light as possible so that I can version it in my repo.为了运行我的 u-tests,我需要一个模型(
fasttext.FastText._FastText
对象),尽可能轻,以便我可以在我的 repo 中对其进行版本控制。
I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called我试图用 5 行“fake.txt”和几个词创建一个假文本数据集,并调用
model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")
It basically works but the model is 16Mb.它基本上可以工作,但模型是 16Mb。 It is kind of ok for a U-test resource but do you think I can go below this?
对于 U-test 资源来说还可以,但你认为我可以低于这个吗?
Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters.请注意,当使用玩具大小的数据或参数时,FastText(以及类似的密集词向量模型)没有有意义的执行。 (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.)
(它们所有有用的/可预测的/可测试的好处都取决于大量的、多样的数据集和许多最终向量的微妙安排。)
But, if you just need a relatively meaningless object/file of the right type, your approach should work.但是,如果您只需要一个相对无意义的正确类型的对象/文件,那么您的方法应该可行。 The main parameter that would make a FastText model larger without regard to the tiny training-set is the
bucket
parameter, with a default value of 2000000
.在不考虑微小训练集的情况下,使 FastText 模型变大的主要参数是
bucket
参数,默认值为2000000
。 It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.它会分配那么多字符-ngram(词片段)插槽,即使您所有的实际单词都没有那么多的 ngram。
Setting bucket
to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.在初始模型创建时,将
bucket
设置为某个小得多的值也应该使您的插件/替身文件小得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.