简体   繁体   English

为 utest 模拟 FastText 模型

[英]Mocking FastText model for utest

I am using fasttext models in my python library (from the official fasttext library).我在我的 python 库中使用 fasttext 模型(来自官方的fasttext库)。 To run my u-tests, I need at some point a model ( fasttext.FastText._FastText object), as light as possible so that I can version it in my repo.为了运行我的 u-tests,我需要一个模型( fasttext.FastText._FastText对象),尽可能轻,以便我可以在我的 repo 中对其进行版本控制。

I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called我试图用 5 行“fake.txt”和几个词创建一个假文本数据集,并调用

model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")

It basically works but the model is 16Mb.它基本上可以工作,但模型是 16Mb。 It is kind of ok for a U-test resource but do you think I can go below this?对于 U-test 资源来说还可以,但你认为我可以低于这个吗?

Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters.请注意,当使用玩具大小的数据或参数时,FastText(以及类似的密集词向量模型)没有有意义的执行。 (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.) (它们所有有用的/可预测的/可测试的好处都取决于大量的、多样的数据集和许多最终向量的微妙安排。)

But, if you just need a relatively meaningless object/file of the right type, your approach should work.但是,如果您只需要一个相对无意义的正确类型的对象/文件,那么您的方法应该可行。 The main parameter that would make a FastText model larger without regard to the tiny training-set is the bucket parameter, with a default value of 2000000 .在不考虑微小训练集的情况下,使 FastText 模型变大的主要参数是bucket参数,默认值为2000000 It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.它会分配那么多字符-ngram(词片段)插槽,即使您所有的实际单词都没有那么多的 ngram。

Setting bucket to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.在初始模型创建时,将bucket设置为某个小得多的值也应该使您的插件/替身文件小得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM