简体   繁体   English

在数据流 Python flex 模板中包含另一个文件,ImportError

[英]Including another file in Dataflow Python flex template, ImportError

Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder?是否有 Python Dataflow Flex 模板的示例,其中包含多个文件,其中脚本正在导入同一文件夹中包含的其他文件?

My project structure is like this:我的项目结构是这样的:

├── pipeline
│   ├── __init__.py
│   ├── main.py
│   ├── setup.py
│   ├── custom.py

I'm trying to import custom.py inside of main.py for a dataflow flex template.我正在尝试在 main.py 中导入 custom.py 以获取数据流 flex 模板。

I receive the following error in the pipeline execution:我在管道执行中收到以下错误:

ModuleNotFoundError: No module named 'custom'

The pipeline works fine if I include all of the code in a single file and don't make any imports.如果我将所有代码包含在一个文件中并且不进行任何导入,则管道可以正常工作。

Example Dockerfile:示例 Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

ARG WORKDIR=/dataflow/template/pipeline
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY pipeline /dataflow/template/pipeline

COPY spec/python_command_spec.json /dataflow/template/

ENV DATAFLOW_PYTHON_COMMAND_SPEC /dataflow/template/python_command_spec.json

RUN pip install avro-python3 pyarrow==0.11.1 apache-beam[gcp]==2.24.0

ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

Python spec file: Python 规格文件:

{
    "pyFile":"/dataflow/template/pipeline/main.py"
}
  

I am deploying the template with the following command:我正在使用以下命令部署模板:

gcloud builds submit --project=${PROJECT} --tag ${TARGET_GCR_IMAGE} .

I actually solved this by passing an additional parameter setup_file to the template execution.我实际上是通过向模板执行传递一个额外的参数 setup_file 来解决这个问题的。 Also need to add setup_file parameter to the template metadata还需要在模板元数据中添加setup_file参数

--parameters setup_file="/dataflow/template/pipeline/setup.py"

Apparently the command ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" in the Dockerfile is useless and doesnt actually pick up the setup file.显然, ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"的命令ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"没有用,实际上并没有选择安装文件。

My setup file looked like this:我的安装文件如下所示:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[
        'apache-beam[gcp]==2.24.0'
    ],
 )

After some tests I found out that for some unknown reasons phyton files at working directory ( WORKDIR ) cannot be referenced with an import.经过一些测试,我发现由于某些未知原因,工作目录( WORKDIR )中的 phyton 文件无法通过导入引用。 But it works if you create a subfolder and move the python dependencies into it.但是如果您创建一个子文件夹并将 python 依赖项移动到其中,它就可以工作。 I tested and it worked, for example, in your use case you can have the following structure:我测试过并且它有效,例如,在您的用例中,您可以具有以下结构:

├── pipeline
│   ├── main.py
│   ├── setup.py
│   ├── mypackage
│   │   ├── __init__.py
│   │   ├── custom.py

And you will be able to reference: import mypackage.custom .您将能够参考: import mypackage.custom The Docker file should move in the custom.py to proper directory. Docker 文件应该在custom.py移动到正确的目录。

RUN mkdir -p ${WORKDIR}/mypackage
RUN touch ${WORKDIR}/mypackage/__init__.py
COPY custom.py ${WORKDIR}/mypackage

And the dependecy will be added to the python installation directory:并且依赖会被添加到python安装目录中:

$ docker exec -it <container> /bin/bash
# find / -name custom.py
/usr/local/lib/python3.7/site-packages/mypackage/custom.py

@pavan-kumar-kattamuri asked me to post my solution, so here it is. @pavan-kumar-kattamuri 让我发布我的解决方案,所以在这里。

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:flex_templates_base_image_release_20210120_RC00

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY requirements.txt .


# Read https://stackoverflow.com/questions/65766066/can-i-make-flex-template-jobs-take-less-than-10-minutes-before-they-start-to-pro#comment116304237_65766066
# to understand why apache-beam is not being installed from requirements.txt
RUN pip install --no-cache-dir -U apache-beam==2.26.0
RUN pip install --no-cache-dir -U -r ./requirements.txt

COPY mymodule.py setup.py ./
COPY protoc_gen protoc_gen/

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/mymodule.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

and here is my setup.py:这是我的 setup.py:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[],
    name="my df job modules",
)

好的,使用 apache beam 2.27 似乎我们需要遵循传递 setup_file 参数的原始做法....耻辱..

For me I didn't need to integrate the setup_file in the command to trigger the flex template, here is my Dockerfile:对我来说,我不需要在命令中集成 setup_file 来触发 flex 模板,这是我的 Dockerfile:

FROM gcr.io/dataflow-templates-base/python38-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY . .

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

# Install apache-beam and other dependencies to launch the pipeline
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

This is the command:这是命令:

gcloud dataflow flex-template run "job_ft" --template-file-gcs-location "$TEMPLATE_PATH" --parameters paramA="valA" --region "europe-west1"

A working example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder could be found here: https://github.com/toransahu/apache-beam-eg/tree/main/python/using_flex_template_adv1可以在此处找到 Python 数据流 Flex 模板的工作示例,其中包含多个文件,其中脚本正在导入同一文件夹中包含的其他文件: https://github.com/toransahu/apache-beam-eg/tree/main /python/using_flex_template_adv1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM