简体繁体 English

如何使用 gitlab 缓存来存储 ML 管道的 model 权重？

[英]How to use gitlab cache to store model weights for an ML pipeline?

原文 2021-03-25 03:30:05 8 1 python/ unit-testing/ machine-learning/ continuous-integration/ gitlab

I am using gitlab to host an python-Machine Learning pipeline.我正在使用 gitlab 来托管 python 机器学习管道。 The pipeline includes trained weights of some model which I do not want to store in git.该管道包括一些 model 的训练权重，我不想将其存储在 git 中。 The weights are stored in some remote data-storage that the pipeline automatically pulls when running its job.权重存储在一些远程数据存储中，管道在运行其作业时会自动提取这些数据。

This works, but I have a problem when trying to run some end-end automatic CI tests on with this setup.这可行，但是在尝试使用此设置运行一些端到端自动 CI 测试时我遇到了问题。 I do not want to download the model weights from the remote every time my CI is triggered (since that can get expensive).每次触发我的 CI 时，我不想从遥控器下载 model 权重（因为这可能会变得昂贵）。 In fact, I want to completely block out my internet connection within all CI-tests for security reasons (for example by configuring socket in my conftest.py ).事实上，出于安全原因，我想在所有 CI 测试中完全阻止我的互联网连接（例如，通过在我的conftest.py中配置套接字）。

If I do this, obviously I am not able to access the location where my model weights are stored.如果我这样做，显然我无法访问存储我的 model 权重的位置。 I know I can mock the result of the model for testing, but I actually want to test that the weights of the model is sensible or not.我知道我可以模拟 model 的结果进行测试，但我实际上想测试 model 的权重是否合理。 So mocking is out of the question.所以 mocking 是不可能的。

I posted a similar question before and one of the solutions that I got was to take advantage of gitlab's caching mechanism to store the model weights.我之前发布了一个类似的问题，我得到的解决方案之一是利用 gitlab 的缓存机制来存储 model 权重。

However, I am not able to figure out how to do that exactly.但是，我无法弄清楚如何准确地做到这一点。 From what I understand of caching, if I enable it, gitlab will download the necessary files from the internet once and reuse them in later pipelines.根据我对缓存的了解，如果我启用它，gitlab 将从 Internet 下载必要的文件一次，并在以后的管道中重复使用它们。 However, the solution that I am looking for would look something like this -但是，我正在寻找的解决方案看起来像这样 -

Upload a file to gitlab manually.手动上传文件到 gitlab。
This file is accessible to all my CI jobs, however, this is not tracked by git.我的所有 CI 作业都可以访问此文件，但是 git不跟踪此文件。
When the file becomes outdated (because I created a new model), I manually upload the updated file.当文件过时（因为我创建了一个新模型），我手动上传更新的文件。
With the cache workflow, from what I understand, if I want to update the file, I will have to enable the internet in the testing suite, have the pipeline automatically download the new set of weights, and then disable the internet again once the new cache is set up.使用缓存工作流程，据我了解，如果我想更新文件，我必须在测试套件中启用互联网，让管道自动下载新的权重集，然后在新的权重集再次禁用互联网缓存设置好了。 This feels hacky and unsafe (unsafe, because I never want to enable internet during testing).这感觉很不安全（不安全，因为我从不想在测试期间启用互联网）。

Is there a good solution for this problem?这个问题有好的解决方案吗？

1 个解决方案

One possible solution, but may not flexible enough, is keeping model file in GitLab CI Variables and put into the correct path in the step.一种可能的解决方案，但可能不够灵活，是将 model 文件保留在 GitLab CI 变量中，并在步骤中放入正确的路径。 GitLab CI supports binary file as a variable as well. GitLab CI 也支持二进制文件作为变量。