简体繁体中英

How to use gitlab cache to store model weights for an ML pipeline?

原文 2021-03-25 03:30:05 6 1 python/ unit-testing/ machine-learning/ continuous-integration/ gitlab

I am using gitlab to host an python-Machine Learning pipeline. The pipeline includes trained weights of some model which I do not want to store in git. The weights are stored in some remote data-storage that the pipeline automatically pulls when running its job.

This works, but I have a problem when trying to run some end-end automatic CI tests on with this setup. I do not want to download the model weights from the remote every time my CI is triggered (since that can get expensive). In fact, I want to completely block out my internet connection within all CI-tests for security reasons (for example by configuring socket in my conftest.py ).

If I do this, obviously I am not able to access the location where my model weights are stored. I know I can mock the result of the model for testing, but I actually want to test that the weights of the model is sensible or not. So mocking is out of the question.

I posted a similar question before and one of the solutions that I got was to take advantage of gitlab's caching mechanism to store the model weights.

However, I am not able to figure out how to do that exactly. From what I understand of caching, if I enable it, gitlab will download the necessary files from the internet once and reuse them in later pipelines. However, the solution that I am looking for would look something like this -

Upload a file to gitlab manually.
This file is accessible to all my CI jobs, however, this is not tracked by git.
When the file becomes outdated (because I created a new model), I manually upload the updated file.
With the cache workflow, from what I understand, if I want to update the file, I will have to enable the internet in the testing suite, have the pipeline automatically download the new set of weights, and then disable the internet again once the new cache is set up. This feels hacky and unsafe (unsafe, because I never want to enable internet during testing).

Is there a good solution for this problem?

1 answers

One possible solution, but may not flexible enough, is keeping model file in GitLab CI Variables and put into the correct path in the step. GitLab CI supports binary file as a variable as well.

How to register model from the Azure ML Pipeline Script step

How to use model architecture of pretrained models but no weights

How do I use an environment in an ML Azure Pipeline

Store TF model weights in CPU?

Parameter Tuning for ML model with column transformer and pipeline

How to use Variables in GitLab CI pipeline without AttributeError?

How to use multi-gpu in Keras with shared weights applications model

How do I use my supervised ML model with unsupervised data?

How to chain ML models/pipeline models sequentially?

How to split Azure ML pipeline steps to debug

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to register model from the Azure ML Pipeline Script step How to use model architecture of pretrained models but no weights How do I use an environment in an ML Azure Pipeline Store TF model weights in CPU? Parameter Tuning for ML model with column transformer and pipeline How to use Variables in GitLab CI pipeline without AttributeError? How to use multi-gpu in Keras with shared weights applications model How do I use my supervised ML model with unsupervised data? How to chain ML models/pipeline models sequentially? How to split Azure ML pipeline steps to debug

Related Tags

How to use gitlab cache to store model weights for an ML pipeline?

Question

1 answers

solution1 1 ACCPTED 2021-03-26 07:28:57

solution1
1 ACCPTED 2021-03-26 07:28:57