[英]How to run the python script within a docker container through nextflow seemlessly without any path or env related issues?
I am trying to run a python script using nextflow and docker.我正在尝试使用 nextflow 和 docker 运行 python 脚本。 I am using a dockerfile (as shown below) to create a docker image.
我正在使用 dockerfile(如下所示)来创建 docker 图像。 Nextflow script has a simple launch of a python script.
Nextflow 脚本可以简单地启动 python 脚本。 The issue is when I run the same python command from within the docker container (in the interactive mode) it works fine.
问题是当我从 docker 容器(在交互模式下)运行相同的 python 命令时,它工作正常。 But when I launch it using nextflow with a docker container then it throws up error.
但是,当我使用带有 docker 容器的 nextflow 启动它时,它会引发错误。
Dockerfile: Dockerfile:
#!/usr/local/bin/docker
# -*- version: 20.10.2 -*-
############################################
## MULTI-STAGE CONTAINER CONFIGURATION ##
FROM python:3.6.2
RUN apt-get update && apt-get install -y \
apt-transport-https \
software-properties-common \
unzip \
curl
RUN wget -O- https://apt.corretto.aws/corretto.key | apt-key add - && \
add-apt-repository 'deb https://apt.corretto.aws stable main' && \
apt-get update && \
apt-get install -y java-1.8.0-amazon-corretto-jdk
############################################
## PHEKNOWLATOR (PKT_KG) PROJECT SETTINGS ##
# create needed project directories
WORKDIR /PKT
RUN mkdir -p /PKT
RUN mkdir -p /PKT/resources
RUN mkdir -p /PKT/resources/construction_approach
RUN mkdir -p /PKT/resources/edge_data
RUN mkdir -p /PKT/resources/knowledge_graphs
RUN mkdir -p /PKT/resources/node_data
RUN mkdir -p /PKT/resources/ontologies
RUN mkdir -p /PKT/resources/processed_data
RUN mkdir -p /PKT/resources/relations_data
# copy scripts/files needed to run pkt_kg
COPY pkt_kg /PKT/pkt_kg
COPY Main.py /PKT
COPY setup.py /PKT
COPY README.rst /PKT
COPY resources /PKT/resources
# download and copy needed data
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/edge_source_list.txt && mv edge_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ontology_source_list.txt && mv ontology_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/resource_info.txt && mv resource_info.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/subclass_construction_map.pkl && mv subclass_construction_map.pkl resources/construction_approach/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PheKnowLator_MergedOntologies.owl && mv PheKnowLator_MergedOntologies.owl resources/knowledge_graphs/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/node_metadata_dict.pkl && mv node_metadata_dict.pkl resources/node_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt && mv DISEASE_MONDO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt && mv ENSEMBL_GENE_ENTREZ_GENE_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt && mv ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt && mv GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt && mv HPA_GTEx_TISSUE_CELL_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt && mv MESH_CHEBI_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt && mv PHENOTYPE_HPO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt && mv STRING_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt && mv UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/INVERSE_RELATIONS.txt && mv INVERSE_RELATIONS.txt resources/relations_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/RELATIONS_LABELS.txt && mv RELATIONS_LABELS.txt resources/relations_data/
# install needed python libraries
RUN pip install --upgrade pip setuptools
WORKDIR /PKT
RUN pip install .
############################################
## GLOBAL ENVRIONMENT SETTINGS ##
# copy files needed to run docker container
COPY entrypoint.sh /PKT
# update permissions for all files
RUN chmod -R 755 /PKT
# set OWlTools memory (set to a high value, system will only use available memory)
ENV OWLTOOLS_MEMORY=500g
RUN echo $OWLTOOLS_MEMORY
# set python envrionment encoding
RUN export PYTHONIOENCODING=utf-8
Name of the docker image-- pkt:2.0.0 docker 镜像名称 -- pkt:2.0.0
Nextflow script: Nextflow 脚本:
process run_PKTBaseRun{
echo True
container 'pkt:2.0.0'
publishDir "${params.outDir}", mode: 'copy'
output:
file '*' into output_ch
script:
"""
which python
$PWD
pwd
python /PKT/Main.py --onts /PKT/resources/ontology_source_list.txt \
--edg /PKT/resources/edge_source_list.txt \
--res /PKT/resources/resource_info.txt \
--out /PKT/resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no
"""
}
Now when I execute:现在当我执行:
nextflow run main.nf
Then this gives error related to glob.glob modules as it is not listing the files as it must inside the docker container.然后这会给出与 glob.glob 模块相关的错误,因为它没有列出 docker 容器中必须的文件。
However, when i simply run the python code above inside the docker container then it runs seemlessly.但是,当我在 docker 容器内简单地运行上面的 python 代码时,它会无缝运行。
> docker run -it pkt:2.0.0 /bin/bash
/PKT> python Main.py --onts resources/ontology_source_list.txt \
--edg resources/edge_source_list.txt \
--res resources/resource_info.txt \
--out resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no
It is only when I combine nextflow with docker does this code throw errors.只有当我将 nextflow 与 docker 结合使用时,此代码才会引发错误。 I have ensured that the python that is used is that of within the container.
我确保使用的 python 是容器内的。
Questions:问题:
Interestingly,有趣的是,
the output of which python --> python within the container容器内的 output 其中 python --> python
BUT,但,
the output of $PWD --> directory from where nextflow is launched $PWD 的 output --> 启动 nextflow 的目录
the output of pwd --> work directory of nextflow pwd的output --> nextflow的工作目录
All the required files have been added to the docker image.所有需要的文件都已添加到 docker 映像中。
The idea behing this nextflow and docker is to finally run it on aws batch using awscli.这个 nextflow 和 docker 的想法是最终使用 awscli 在 aws 批处理上运行它。 But before running it on aws batch, want to ensure that its running fine on the local server.
但是在 aws batch 上运行它之前,要确保它在本地服务器上运行良好。
Looking forward to your suggestions and ideas.期待您的建议和想法。 Thank you.
谢谢你。
Try escaping the \$PWD
which will give you the the nextflow process workdir which is mounted in docker.尝试 escaping 的
\$PWD
,它将为您提供安装在 docker 中的下一个流程工作目录。 I'm curious if you have solved it some other way?我很好奇您是否以其他方式解决了它?
Try running this in nextflow process script.尝试在 nextflow 流程脚本中运行它。
export pdir=\$PWD
echo \$pdir
Bit of an old question, but for future goggles - Nextflow does quite a bit of behind-the-scenes work when running Docker, including mounting files into the container and setting the working directory.有点老问题,但对于未来的护目镜 - Nextflow 在运行 Docker 时做了很多幕后工作,包括将文件挂载到容器中并设置工作目录。 This is needed so that commands can generally run seamlessly from within a process with the expected inputs.
这是必需的,以便命令通常可以在具有预期输入的进程内无缝运行。 However, it means that some
Dockerfile
configurations such as WORKDIR
will be overwritten.但是,这意味着
WORKDIR
等一些Dockerfile
配置将被覆盖。
Looking at the examples above I would have a couple of suggestions:看看上面的例子,我有几个建议:
path
with the URL and Nextflow will know to download it for you).path
,Nextflow 就会知道为您下载它)。PATH
.PATH
的打包安装。
pip install.
pip install.
but the Nextflow script directly runs an absolute path. One of the benefits of keeping the Dockerfile
as slim as possible is that it makes your pipeline more portable.使
Dockerfile
尽可能纤薄的好处之一是它使您的管道更便携。 If your installed tool is super simple then other people are more likely to be able to run on systems that don't have Docker installed (instead Singularity, Conda etc).如果您安装的工具非常简单,那么其他人更有可能在没有安装 Docker 的系统上运行(而不是 Singularity、Conda 等)。
If you really really need to work within a specific directory in the container, then adding a cd
command into the Nextflow script should work.如果您确实需要在容器中的特定目录中工作,那么将
cd
命令添加到 Nextflow 脚本中应该可以工作。 But bear in mind that your input files will be located within the work directory path inside the container, which will be variable.但请记住,您的输入文件将位于容器内的工作目录路径中,该路径将是可变的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.