简体   繁体   English

如何在 docker 容器中通过 nextflow 无缝运行 python 脚本,没有任何路径或环境相关问题?

[英]How to run the python script within a docker container through nextflow seemlessly without any path or env related issues?

I am trying to run a python script using nextflow and docker.我正在尝试使用 nextflow 和 docker 运行 python 脚本。 I am using a dockerfile (as shown below) to create a docker image.我正在使用 dockerfile(如下所示)来创建 docker 图像。 Nextflow script has a simple launch of a python script. Nextflow 脚本可以简单地启动 python 脚本。 The issue is when I run the same python command from within the docker container (in the interactive mode) it works fine.问题是当我从 docker 容器(在交互模式下)运行相同的 python 命令时,它工作正常。 But when I launch it using nextflow with a docker container then it throws up error.但是,当我使用带有 docker 容器的 nextflow 启动它时,它会引发错误。

Dockerfile: Dockerfile:

#!/usr/local/bin/docker
# -*- version: 20.10.2 -*-

############################################
## MULTI-STAGE CONTAINER CONFIGURATION ##
FROM python:3.6.2
RUN apt-get update && apt-get install -y \
    apt-transport-https \
    software-properties-common \
    unzip \
    curl
RUN wget -O- https://apt.corretto.aws/corretto.key | apt-key add - && \
    add-apt-repository 'deb https://apt.corretto.aws stable main' && \
    apt-get update && \
    apt-get install -y java-1.8.0-amazon-corretto-jdk


############################################
## PHEKNOWLATOR (PKT_KG) PROJECT SETTINGS ##
# create needed project directories
WORKDIR /PKT
RUN mkdir -p /PKT
RUN mkdir -p /PKT/resources
RUN mkdir -p /PKT/resources/construction_approach
RUN mkdir -p /PKT/resources/edge_data
RUN mkdir -p /PKT/resources/knowledge_graphs
RUN mkdir -p /PKT/resources/node_data
RUN mkdir -p /PKT/resources/ontologies
RUN mkdir -p /PKT/resources/processed_data
RUN mkdir -p /PKT/resources/relations_data

# copy scripts/files needed to run pkt_kg
COPY pkt_kg /PKT/pkt_kg
COPY Main.py /PKT
COPY setup.py /PKT
COPY README.rst /PKT
COPY resources /PKT/resources

# download and copy needed data
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/edge_source_list.txt && mv edge_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ontology_source_list.txt && mv ontology_source_list.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/resource_info.txt && mv resource_info.txt resources/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/subclass_construction_map.pkl && mv subclass_construction_map.pkl resources/construction_approach/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PheKnowLator_MergedOntologies.owl && mv PheKnowLator_MergedOntologies.owl resources/knowledge_graphs/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/node_metadata_dict.pkl && mv node_metadata_dict.pkl resources/node_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/DISEASE_MONDO_MAP.txt && mv DISEASE_MONDO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENSEMBL_GENE_ENTREZ_GENE_MAP.txt && mv ENSEMBL_GENE_ENTREZ_GENE_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt && mv ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt && mv GENE_SYMBOL_ENSEMBL_TRANSCRIPT_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/HPA_GTEx_TISSUE_CELL_MAP.txt && mv HPA_GTEx_TISSUE_CELL_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/MESH_CHEBI_MAP.txt && mv MESH_CHEBI_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/PHENOTYPE_HPO_MAP.txt && mv PHENOTYPE_HPO_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/STRING_PRO_ONTOLOGY_MAP.txt && mv STRING_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt && mv UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt resources/processed_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/INVERSE_RELATIONS.txt && mv INVERSE_RELATIONS.txt resources/relations_data/
RUN curl -O https://storage.googleapis.com/pheknowlator/current_build/data/processed_data/RELATIONS_LABELS.txt && mv RELATIONS_LABELS.txt resources/relations_data/

# install needed python libraries
RUN pip install --upgrade pip setuptools
WORKDIR /PKT
RUN pip install .


############################################
## GLOBAL ENVRIONMENT SETTINGS ##
# copy files needed to run docker container
COPY entrypoint.sh /PKT

# update permissions for all files
RUN chmod -R 755 /PKT

# set OWlTools memory (set to a high value, system will only use available memory)
ENV OWLTOOLS_MEMORY=500g
RUN echo $OWLTOOLS_MEMORY

# set python envrionment encoding
RUN export PYTHONIOENCODING=utf-8

Name of the docker image-- pkt:2.0.0 docker 镜像名称 -- pkt:2.0.0

Nextflow script: Nextflow 脚本:

process run_PKTBaseRun{

echo True

container 'pkt:2.0.0'
publishDir "${params.outDir}", mode: 'copy'

output:
file '*' into output_ch

script:
"""
which python
$PWD
pwd
python /PKT/Main.py --onts /PKT/resources/ontology_source_list.txt \
            --edg /PKT/resources/edge_source_list.txt \
            --res /PKT/resources/resource_info.txt \
            --out /PKT/resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no
"""


}

Now when I execute:现在当我执行:

nextflow run main.nf

Then this gives error related to glob.glob modules as it is not listing the files as it must inside the docker container.然后这会给出与 glob.glob 模块相关的错误,因为它没有列出 docker 容器中必须的文件。

However, when i simply run the python code above inside the docker container then it runs seemlessly.但是,当我在 docker 容器内简单地运行上面的 python 代码时,它会无缝运行。

> docker run -it pkt:2.0.0 /bin/bash

/PKT> python Main.py --onts resources/ontology_source_list.txt \
            --edg resources/edge_source_list.txt \
            --res resources/resource_info.txt \
            --out resources/knowledge_graphs --app subclass --kg full --nde yes --rel yes --owl no

It is only when I combine nextflow with docker does this code throw errors.只有当我将 nextflow 与 docker 结合使用时,此代码才会引发错误。 I have ensured that the python that is used is that of within the container.我确保使用的 python 是容器内的。

Questions:问题:

  1. Any ideas/thoughts to make it work?有什么想法/想法让它发挥作用吗?

Interestingly,有趣的是,
the output of which python --> python within the container容器内的 output 其中 python --> python
BUT,但,
the output of $PWD --> directory from where nextflow is launched $PWD 的 output --> 启动 nextflow 的目录
the output of pwd --> work directory of nextflow pwd的output --> nextflow的工作目录

  1. When we add container in the nextflow process, it is not that the commands inside the nextflow process (run_PKTBaseRun) are run from the container workdir?Therefore should value of pwd not be that of container workdir instead of nextflow workdir?当我们在nextflow进程中添加容器时,nextflow进程内部的命令(run_PKTBaseRun)不就是从容器workdir运行的吗?所以pwd的值不应该是container workdir的值而不是nextflow workdir的值吗?

All the required files have been added to the docker image.所有需要的文件都已添加到 docker 映像中。

  1. Is there a way to ensure that the commands within the script section in the nextflow process are run from the docker root/workdir?有没有办法确保 nextflow 过程中脚本部分中的命令从 docker 根/工作目录运行?

The idea behing this nextflow and docker is to finally run it on aws batch using awscli.这个 nextflow 和 docker 的想法是最终使用 awscli 在 aws 批处理上运行它。 But before running it on aws batch, want to ensure that its running fine on the local server.但是在 aws batch 上运行它之前,要确保它在本地服务器上运行良好。

Looking forward to your suggestions and ideas.期待您的建议和想法。 Thank you.谢谢你。

Try escaping the \$PWD which will give you the the nextflow process workdir which is mounted in docker.尝试 escaping 的\$PWD ,它将为您提供安装在 docker 中的下一个流程工作目录。 I'm curious if you have solved it some other way?我很好奇您是否以其他方式解决了它?

Try running this in nextflow process script.尝试在 nextflow 流程脚本中运行它。

export pdir=\$PWD
echo \$pdir

Bit of an old question, but for future goggles - Nextflow does quite a bit of behind-the-scenes work when running Docker, including mounting files into the container and setting the working directory.有点老问题,但对于未来的护目镜 - Nextflow 在运行 Docker 时做了很多幕后工作,包括将文件挂载到容器中并设置工作目录。 This is needed so that commands can generally run seamlessly from within a process with the expected inputs.这是必需的,以便命令通常可以在具有预期输入的进程内无缝运行。 However, it means that some Dockerfile configurations such as WORKDIR will be overwritten.但是,这意味着WORKDIR等一些Dockerfile配置将被覆盖。

Looking at the examples above I would have a couple of suggestions:看看上面的例子,我有几个建议:

  1. It's usually better to stage external data into the Nextflow process rather than saving it into the container (just specify a path with the URL and Nextflow will know to download it for you).通常最好将外部数据暂存到 Nextflow 进程中,而不是将其保存到容器中(只需使用 URL 指定path ,Nextflow 就会知道为您下载它)。
  2. Try not to rely on a specific working directory within the container but instead go for packaged installs that add command line tools to the PATH .尽量不要依赖容器中的特定工作目录,而是 go 用于将命令行工具添加到PATH的打包安装。
    • Bit difficult to know if you're doing this already - there's a pip install.有点难以知道您是否已经在这样做 - pip install. but the Nextflow script directly runs an absolute path.但 Nextflow 脚本直接运行绝对路径。

One of the benefits of keeping the Dockerfile as slim as possible is that it makes your pipeline more portable.使Dockerfile尽可能纤薄的好处之一是它使您的管道更便携。 If your installed tool is super simple then other people are more likely to be able to run on systems that don't have Docker installed (instead Singularity, Conda etc).如果您安装的工具非常简单,那么其他人更有可能在没有安装 Docker 的系统上运行(而不是 Singularity、Conda 等)。

If you really really need to work within a specific directory in the container, then adding a cd command into the Nextflow script should work.如果您确实需要在容器中的特定目录中工作,那么将cd命令添加到 Nextflow 脚本中应该可以工作。 But bear in mind that your input files will be located within the work directory path inside the container, which will be variable.但请记住,您的输入文件将位于容器内的工作目录路径中,该路径将是可变的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 脚本在 docker 容器中找不到使用 CRON 运行的 ENV 变量 - Python script can't find ENV variable run with CRON in docker container 如何在 Docker 容器中运行本地 Python 脚本? - How do I run a local Python script in a Docker container? 如何在jenkins管道的docker容器中运行python脚本 - How to run a python script in a docker container in jenkins pipeline 如何在不知道程序确切路径的情况下从 python 脚本级别运行任何程序? - How to run any program from a python script level without knowing the exact path to the program? 如何使用 Docker 容器在 Java 程序中运行 python 脚本? - How to run a python script in Java program using Docker container? 如何在脚本中运行python脚本,并将脚本动态发送到docker容器? - How to run python script in docker with the script being sent dynamically to docker container? docker 容器中的 Running.env 文件 - Running .env files within a docker container 在运行的 Docker 容器中激活 conda env? - Activate conda env within running Docker container? 如何在不输入完整文件路径的情况下运行 Python 脚本 - How to run Python script without inputting complete file path 如何在不指定完整路径的情况下可移植地运行 Python 脚本 - How to run a Python script portably without specifying its full path
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM