简体   繁体   中英

Machine Learning Tools Docker Image Size Issue

I need a docker container with the following packages installed on it for some sort of computational analysis. The packages listed below are inside the requirements.txt file.

boto3 = "*"
nltk ="*"
pandas = "*"
scikit-learn = "*"
sentence_transformers = "*"
spacy = {extras = ["lookups"],version = "*"}
streamlit = "*"
tensorflow = "*"
unidecode = "*"

I have write a Dockerfile for this thing, The issue here I am facing is the size of the Docker Image which is around 6 GB (6.42 exactly). Can anybody help me with this issue, How I can reduce the size of the Docker Image.

Here is the DockerFile

FROM python:3.7-slim-buster as base

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY . /opt/program

WORKDIR /opt/program/

RUN chmod +x train

# Install dependencies
RUN apt-get update \
    && apt-get upgrade -y \
    && apt-get autoremove -y \
    && apt-get install -y \
    gcc \
    build-essential \
    zlib1g-dev \
    wget \
    unzip \
    cmake \
    python3-dev \
    gfortran \
    libblas-dev \
    liblapack-dev \
    libatlas-base-dev \
    && apt-get clean

# Install Python packages
RUN pip install --upgrade pip \
    && pip install \
    ipython[all] \
    nose \
    matplotlib \
    pandas \
    scipy \
    sympy \
    && rm -fr /root/.cache

RUN pip install --install-option="--prefix=/install" -r requirements.txt

You are installing a lot of stuff into that image therefore it will get kind of big anyway but there might be some stuff that you can do about it.

The minor one - remove /var/lib/apt/lists/* after you are done installing the stuff via apt.

RUN rm -rf /var/lib/apt/lists/*

The major one - from the contents of Dockerfile, I guess that it is used to train a model which requires training data and this can take a lot of space since you are copying everything into the image. These data don't need to be present in the image, rather they need to be loaded into the container built from the image.

Instead of copying everything into the image, copy files that are only necessary to run the logic but load the data in some other way. One such way would be to bind mount the data into the image. You could store the data in a separate folder, let's say ./data and include this folder in your .dockerignore file (so that it is not copied over). Then, depending on how you are launching the container, you can specify the bind mount such as

docker container run -v ./data:/<path-inside-image> ...

Replace <path-inside-image> with path where the data should be located but be careful not to mount to directory that already holds some essential files since those will be obscured by the mounted folder.

If using bind mount is not a viable solution for you then you will need to figure out a better way to load the data into the container, for example, pulling them from the internet or from some other network attached storage once the container is running.

Get some method from other's Dockerfile,or documents:

  • delete apt cache

do rm -rf /var/lib/apt/lists/* after you run apt-install,such as

RUN apt-get update && apt-get install -y \
        ca-certificates \
        netbase \
    && rm -rf /var/lib/apt/lists/*

Not:

RUN apt-get update && apt-get install -y \
      ca-certificates \
      netbase
RUN rm -rf /var/lib/apt/lists/*
  • no-install-recommends
RUN apt-get update && apt-get install -y --no-install-recommends \
        ca-certificates \
        netbase \
    && rm -rf /var/lib/apt/lists/*

no-install-recommends means: do not install non-essential dependency packages.

  • remove middle software

egg:

RUN apt-get update && apt-get install -y --no-install-recommends \
        gcc \
        g++ \
    && pip install cython && apt-get  remove -y gcc g++ \ 
    && rm -rf /var/lib/apt/lists/*

Some software,like gcc,only use when install some software,we can remove it after install finish.

  • pip use no cache

egg:


RUN pip install --no-cache-dir -r requirements.txt

  • download and remove better than copy?

I am not sure it.From other's Dockerfile, they download file and finally delete it after use in one RUN ,not copy file in it.

  • Not docker a model data into a image.

If you use tensorflow or other AI application,you may have some model data(size is a few G),better way is download it when run in container or by ftp,object storage,or others way —— not in image,just mount or download.

  • take care about the.git folder

Just in my experience. If you use git to contorl codes. The .git folder may very very big. The command COPY. /XXX COPY. /XXX will copy .git to image.Find a way to filter the .git .For my use:


FROM  apline:3.12 as MID
COPY XXX /XXX/
COPY ... /XXX/

FROM image:youneed
COPY --from=MID /XXX/ /XXX/ 
RUN apt-get update && xxxxx

CMD ["python","app.py"]

or use .dockerignore .

get above from:

In your Dockerfile

# Did wget,cmake and some on  is necessary?

COPY . /opt/program

WORKDIR /opt/program/

# Install dependencies
RUN chmod +x train && apt-get update \
    && apt-get upgrade -y \
    && apt-get autoremove -y \
    && apt-get install -y \
    gcc \
    build-essential \
    zlib1g-dev \
    wget \
    unzip \
    cmake \
    python3-dev \
    gfortran \
    libblas-dev \
    liblapack-dev \
    libatlas-base-dev \
    && apt-get clean && pip install --upgrade pip \
    && pip install --no-cache-dir \
    ipython[all] \
    nose \
    matplotlib \
    pandas \
    scipy \
    sympy \
    && pip install --no-cache-dir --install-option="--prefix=/install" -r requirements.txt
    && apt-get remove -y gcc unzip cmake \ # just have a try,to find what software we can remove.
    && rm -rf /var/lib/apt/lists/*
    && rm -fr /root/.cache

Of course, by this way, you may get a just smaller size image,but docker build process, will not use docker's cache .So during you try to find what software can delete, split to two or three commands RUN to use more docker cache.

Hope to help you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM