简体   繁体   English

机器学习工具 Docker 图像大小问题

[英]Machine Learning Tools Docker Image Size Issue

I need a docker container with the following packages installed on it for some sort of computational analysis.我需要一个 docker 容器,上面安装了以下软件包以进行某种计算分析。 The packages listed below are inside the requirements.txt file.下面列出的包在 requirements.txt 文件中。

boto3 = "*"
nltk ="*"
pandas = "*"
scikit-learn = "*"
sentence_transformers = "*"
spacy = {extras = ["lookups"],version = "*"}
streamlit = "*"
tensorflow = "*"
unidecode = "*"

I have write a Dockerfile for this thing, The issue here I am facing is the size of the Docker Image which is around 6 GB (6.42 exactly).我已经为这个东西写了一个 Dockerfile,我面临的问题是 Docker 图像的大小约为 6 GB(确切地说是 6.42)。 Can anybody help me with this issue, How I can reduce the size of the Docker Image.谁能帮我解决这个问题,如何减小 Docker 图像的大小。

Here is the DockerFile这是 DockerFile

FROM python:3.7-slim-buster as base

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY . /opt/program

WORKDIR /opt/program/

RUN chmod +x train

# Install dependencies
RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get autoremove -y \
    && apt-get install -y \
    gcc \
    build-essential \
    zlib1g-dev \
    wget \
    unzip \
    cmake \
    python3-dev \
    gfortran \
    libblas-dev \
    liblapack-dev \
    libatlas-base-dev \
    && apt-get clean

# Install Python packages
RUN pip install --upgrade pip \
    && pip install \
    ipython[all] \
    nose \
    matplotlib \
    pandas \
    scipy \
    sympy \
    && rm -fr /root/.cache

RUN pip install --install-option="--prefix=/install" -r requirements.txt

You are installing a lot of stuff into that image therefore it will get kind of big anyway but there might be some stuff that you can do about it.您在该映像中安装了很多东西,因此无论如何它都会变得很大,但是您可能可以做一些事情。

The minor one - remove /var/lib/apt/lists/* after you are done installing the stuff via apt.次要的 - 在通过 apt 安装完这些东西后删除/var/lib/apt/lists/*

RUN rm -rf /var/lib/apt/lists/*

The major one - from the contents of Dockerfile, I guess that it is used to train a model which requires training data and this can take a lot of space since you are copying everything into the image.主要的 - 从 Dockerfile 的内容来看,我猜它用于训练需要训练数据的 model,这可能会占用大量空间,因为您要将所有内容复制到图像中。 These data don't need to be present in the image, rather they need to be loaded into the container built from the image.这些数据不需要存在于镜像中,而是需要加载到从镜像构建的容器中。

Instead of copying everything into the image, copy files that are only necessary to run the logic but load the data in some other way.与其将所有内容复制到映像中,不如复制仅在运行逻辑时需要但以其他方式加载数据的文件。 One such way would be to bind mount the data into the image.一种这样的方法是将数据绑定到图像中。 You could store the data in a separate folder, let's say ./data and include this folder in your .dockerignore file (so that it is not copied over).您可以将数据存储在一个单独的文件夹中,比如说./data并将这个文件夹包含在您的.dockerignore文件中(这样它就不会被复制过来)。 Then, depending on how you are launching the container, you can specify the bind mount such as然后,根据您启动容器的方式,您可以指定绑定挂载,例如

docker container run -v ./data:/<path-inside-image> ...

Replace <path-inside-image> with path where the data should be located but be careful not to mount to directory that already holds some essential files since those will be obscured by the mounted folder.<path-inside-image>替换为数据所在的路径,但注意不要挂载到已经包含一些重要文件的目录,因为这些文件会被挂载的文件夹遮住。

If using bind mount is not a viable solution for you then you will need to figure out a better way to load the data into the container, for example, pulling them from the internet or from some other network attached storage once the container is running.如果使用绑定挂载对您来说不是一个可行的解决方案,那么您将需要找到一种更好的方法将数据加载到容器中,例如,一旦容器运行,就从互联网或其他网络连接存储中提取它们。

Get some method from other's Dockerfile,or documents:从别人的 Dockerfile 或文档中获取一些方法:

  • delete apt cache删除apt缓存

do rm -rf /var/lib/apt/lists/* after you run apt-install,such as运行 apt-install 后执行rm -rf /var/lib/apt/lists/* ,例如

RUN apt-get update && apt-get install -y \
        ca-certificates \
        netbase \
    && rm -rf /var/lib/apt/lists/*

Not:不是:

RUN apt-get update && apt-get install -y \
      ca-certificates \
      netbase
RUN rm -rf /var/lib/apt/lists/*
  • no-install-recommends无安装推荐
RUN apt-get update && apt-get install -y --no-install-recommends \
        ca-certificates \
        netbase \
    && rm -rf /var/lib/apt/lists/*

no-install-recommends means: do not install non-essential dependency packages. no-install-recommends 表示:不要安装非必要的依赖包。

  • remove middle software删除中间软件

egg:蛋:

RUN apt-get update && apt-get install -y --no-install-recommends \
        gcc \
        g++ \
    && pip install cython && apt-get  remove -y gcc g++ \ 
    && rm -rf /var/lib/apt/lists/*

Some software,like gcc,only use when install some software,we can remove it after install finish.有些软件,如gcc,只在安装某些软件时使用,安装完成后我们可以将其删除。

  • pip use no cache pip 不使用缓存

egg:蛋:


RUN pip install --no-cache-dir -r requirements.txt

  • download and remove better than copy?下载和删除比复制更好?

I am not sure it.From other's Dockerfile, they download file and finally delete it after use in one RUN ,not copy file in it.我不确定。从其他的Dockerfile,他们下载文件,最后在一次RUN中使用后将其删除,而不是在其中复制文件。

  • Not docker a model data into a image.不是 docker 一个 model 数据转换成图像。

If you use tensorflow or other AI application,you may have some model data(size is a few G),better way is download it when run in container or by ftp,object storage,or others way —— not in image,just mount or download. If you use tensorflow or other AI application,you may have some model data(size is a few G),better way is download it when run in container or by ftp,object storage,or others way —— not in image,just mount或下载。

  • take care about the.git folder注意.git文件夹

Just in my experience.就我的经验而言。 If you use git to contorl codes.如果您使用 git 来控制代码。 The .git folder may very very big. .git文件夹可能很大很大。 The command COPY. /XXX命令COPY. /XXX COPY. /XXX will copy .git to image.Find a way to filter the .git .For my use: COPY. /XXX.git复制到图像。找到一种过滤.git的方法。供我使用:


FROM  apline:3.12 as MID
COPY XXX /XXX/
COPY ... /XXX/

FROM image:youneed
COPY --from=MID /XXX/ /XXX/ 
RUN apt-get update && xxxxx

CMD ["python","app.py"]

or use .dockerignore .或使用.dockerignore

get above from:从上面得到:

In your Dockerfile在你的 Dockerfile

# Did wget,cmake and some on  is necessary?

COPY . /opt/program

WORKDIR /opt/program/

# Install dependencies
RUN chmod +x train && apt-get update \
    && apt-get upgrade -y \
    && apt-get autoremove -y \
    && apt-get install -y \
    gcc \
    build-essential \
    zlib1g-dev \
    wget \
    unzip \
    cmake \
    python3-dev \
    gfortran \
    libblas-dev \
    liblapack-dev \
    libatlas-base-dev \
    && apt-get clean && pip install --upgrade pip \
    && pip install --no-cache-dir \
    ipython[all] \
    nose \
    matplotlib \
    pandas \
    scipy \
    sympy \
    && pip install --no-cache-dir --install-option="--prefix=/install" -r requirements.txt
    && apt-get remove -y gcc unzip cmake \ # just have a try,to find what software we can remove.
    && rm -rf /var/lib/apt/lists/*
    && rm -fr /root/.cache

Of course, by this way, you may get a just smaller size image,but docker build process, will not use docker's cache .So during you try to find what software can delete, split to two or three commands RUN to use more docker cache.当然,通过这种方式,你可能会得到一个更小尺寸的图像,但是 docker 构建过程,不会使用 docker 的缓存。所以在你尝试找到可以删除的软件时,分成两三个命令RUN使用更多 docker 缓存.

Hope to help you.希望能帮到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM