为 utest 模拟 FastText 模型

Question

I am using fasttext models in my python library (from the official fasttext library).我在我的 python 库中使用 fasttext 模型（来自官方的fasttext库）。 To run my u-tests, I need at some point a model ( fasttext.FastText._FastText object), as light as possible so that I can version it in my repo.为了运行我的 u-tests，我需要一个模型（ fasttext.FastText._FastText对象），尽可能轻，以便我可以在我的 repo 中对其进行版本控制。

I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called我试图用 5 行“fake.txt”和几个词创建一个假文本数据集，并调用

model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")

It basically works but the model is 16Mb.它基本上可以工作，但模型是 16Mb。 It is kind of ok for a U-test resource but do you think I can go below this?对于 U-test 资源来说还可以，但你认为我可以低于这个吗？

Answer 1

Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters.请注意，当使用玩具大小的数据或参数时，FastText（以及类似的密集词向量模型）没有有意义的执行。 (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.) （它们所有有用的/可预测的/可测试的好处都取决于大量的、多样的数据集和许多最终向量的微妙安排。）

But, if you just need a relatively meaningless object/file of the right type, your approach should work.但是，如果您只需要一个相对无意义的正确类型的对象/文件，那么您的方法应该可行。 The main parameter that would make a FastText model larger without regard to the tiny training-set is the bucket parameter, with a default value of 2000000 .在不考虑微小训练集的情况下，使 FastText 模型变大的主要参数是bucket参数，默认值为2000000 。 It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.它会分配那么多字符-ngram（词片段）插槽，即使您所有的实际单词都没有那么多的 ngram。

Setting bucket to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.在初始模型创建时，将bucket设置为某个小得多的值也应该使您的插件/替身文件小得多。

为 utest 模拟 FastText 模型

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-24 17:43:51

为 utest 模拟 FastText 模型

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-24 17:43:51

解决方案1
1 已采纳 2020-10-24 17:43:51