I need a docker container with the following packages installed on it for some sort of computational analysis. The packages listed below are inside the requirements.txt file.
boto3 = "*"
nltk ="*"
pandas = "*"
scikit-learn = "*"
sentence_transformers = "*"
spacy = {extras = ["lookups"],version = "*"}
streamlit = "*"
tensorflow = "*"
unidecode = "*"
I have write a Dockerfile for this thing, The issue here I am facing is the size of the Docker Image which is around 6 GB (6.42 exactly). Can anybody help me with this issue, How I can reduce the size of the Docker Image.
Here is the DockerFile
FROM python:3.7-slim-buster as base
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
COPY . /opt/program
WORKDIR /opt/program/
RUN chmod +x train
# Install dependencies
RUN apt-get update \
&& apt-get upgrade -y \
&& apt-get autoremove -y \
&& apt-get install -y \
gcc \
build-essential \
zlib1g-dev \
wget \
unzip \
cmake \
python3-dev \
gfortran \
libblas-dev \
liblapack-dev \
libatlas-base-dev \
&& apt-get clean
# Install Python packages
RUN pip install --upgrade pip \
&& pip install \
ipython[all] \
nose \
matplotlib \
pandas \
scipy \
sympy \
&& rm -fr /root/.cache
RUN pip install --install-option="--prefix=/install" -r requirements.txt
You are installing a lot of stuff into that image therefore it will get kind of big anyway but there might be some stuff that you can do about it.
The minor one - remove /var/lib/apt/lists/*
after you are done installing the stuff via apt.
RUN rm -rf /var/lib/apt/lists/*
The major one - from the contents of Dockerfile, I guess that it is used to train a model which requires training data and this can take a lot of space since you are copying everything into the image. These data don't need to be present in the image, rather they need to be loaded into the container built from the image.
Instead of copying everything into the image, copy files that are only necessary to run the logic but load the data in some other way. One such way would be to bind mount the data into the image. You could store the data in a separate folder, let's say ./data
and include this folder in your .dockerignore
file (so that it is not copied over). Then, depending on how you are launching the container, you can specify the bind mount such as
docker container run -v ./data:/<path-inside-image> ...
Replace <path-inside-image>
with path where the data should be located but be careful not to mount to directory that already holds some essential files since those will be obscured by the mounted folder.
If using bind mount is not a viable solution for you then you will need to figure out a better way to load the data into the container, for example, pulling them from the internet or from some other network attached storage once the container is running.
do rm -rf /var/lib/apt/lists/*
after you run apt-install,such as
RUN apt-get update && apt-get install -y \
ca-certificates \
netbase \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y \
ca-certificates \
netbase
RUN rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
netbase \
&& rm -rf /var/lib/apt/lists/*
no-install-recommends means: do not install non-essential dependency packages.
egg:
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
&& pip install cython && apt-get remove -y gcc g++ \
&& rm -rf /var/lib/apt/lists/*
Some software,like gcc,only use when install some software,we can remove it after install finish.
egg:
RUN pip install --no-cache-dir -r requirements.txt
I am not sure it.From other's Dockerfile, they download file and finally delete it after use in one RUN
,not copy file in it.
If you use tensorflow or other AI application,you may have some model data(size is a few G),better way is download it when run in container or by ftp,object storage,or others way —— not in image,just mount or download.
Just in my experience. If you use git to contorl codes. The .git
folder may very very big. The command COPY. /XXX
COPY. /XXX
will copy .git
to image.Find a way to filter the .git
.For my use:
FROM apline:3.12 as MID
COPY XXX /XXX/
COPY ... /XXX/
FROM image:youneed
COPY --from=MID /XXX/ /XXX/
RUN apt-get update && xxxxx
CMD ["python","app.py"]
or use .dockerignore
.
# Did wget,cmake and some on is necessary?
COPY . /opt/program
WORKDIR /opt/program/
# Install dependencies
RUN chmod +x train && apt-get update \
&& apt-get upgrade -y \
&& apt-get autoremove -y \
&& apt-get install -y \
gcc \
build-essential \
zlib1g-dev \
wget \
unzip \
cmake \
python3-dev \
gfortran \
libblas-dev \
liblapack-dev \
libatlas-base-dev \
&& apt-get clean && pip install --upgrade pip \
&& pip install --no-cache-dir \
ipython[all] \
nose \
matplotlib \
pandas \
scipy \
sympy \
&& pip install --no-cache-dir --install-option="--prefix=/install" -r requirements.txt
&& apt-get remove -y gcc unzip cmake \ # just have a try,to find what software we can remove.
&& rm -rf /var/lib/apt/lists/*
&& rm -fr /root/.cache
Of course, by this way, you may get a just smaller size image,but docker build process, will not use docker's cache .So during you try to find what software can delete, split to two or three commands RUN
to use more docker cache.
Hope to help you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.