简体   繁体   English

Apache Beam 是否需要互联网才能运行 GCP Dataflow 作业

[英]Does Apache Beam need internet to run GCP Dataflow jobs

I am trying to deploy an Dataflow Job on a GCP VM that will have access to GCP resources but will not have internet access.我正在尝试在可以访问 GCP 资源但无法访问 Internet 的 GCP VM 上部署 Dataflow 作业。 When I try to run the job I get a connection timeout error, which would make sense if I were trying to connect to the internet.当我尝试运行该作业时,出现连接超时错误,如果我尝试连接到 Internet,这将是有意义的。 The code breaks because an http connection is being attempted on behalf of apache-beam.代码中断是因为正在代表 apache-beam 尝试 http 连接。

Python Set up: Before cutting off the VM, I installed all necessary packages using pip and a requirements.txt. Python 设置:在切断 VM 之前,我使用 pip 和 requirements.txt 安装了所有必需的包。 This seemed to work because other parts of the code work fine.这似乎有效,因为代码的其他部分工作正常。

The following is the error message I receive when I run the code.以下是我运行代码时收到的错误消息。

Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) 
after connection broken by 'ConnectTimeoutError(
<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at foo>, 
'Connection to pypi.org timed out. (connect timeout=15)')': /simple/apache-beam/

Could not find a version that satisfies the requirement apache-beam==2.9.0 (from versions: )

No matching distribution found for apache-beam==2.9.0

I if you are running a python job does it need the to connect to pypi?如果您正在运行 python 作业,是否需要连接到 pypi? Is there a hack around this?有没有解决这个问题?

When we use google cloud composer with private ip enabled, we don't have access to internet.当我们使用启用了私有 ip 的 google cloud composer 时,我们无法访问互联网。

To solve this:要解决这个问题:

  • Create GKE cluster and create a new node pool name "default-pool"(use same name).创建 GKE 集群并创建一个新的节点池名称“default-pool”(使用相同的名称)。
  • In network tag: add "private".在网络标签中:添加“私有”。
  • In security: Check allow access to all cloud api.在安全方面:选中允许访问所有云 API。

If you run a DataflowPythonOperator in a private Cloud Composer, the job needs to access the internet to download a set of packages from the image projects/dataflow-service-producer-prod .如果您在私有 Cloud Composer 中运行 DataflowPythonOperator,则该作业需要访问 Internet 以从映像projects/dataflow-service-producer-prod下载一组包。 But within the private cluster, VMs and GKEs don't have access to the internet.但在私有集群中,VM 和 GKE 无法访问 Internet。

To solve this problem, you need to create a Cloud NAT and a router: https://cloud.google.com/nat/docs/gke-example#step_6_create_a_nat_configuration_using要解决这个问题,您需要创建一个 Cloud NAT 和一个路由器: https : //cloud.google.com/nat/docs/gke-example#step_6_create_a_nat_configuration_using

This will allow your instances to send packets to the internet and receive inbound traffic.这将允许您的实例将数据包发送到 Internet 并接收入站流量。

TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a SetupOption sdk_location variable in your Dataflow pipeline. TL;DR:将 Apache Beam SDK 存档复制到可访问的路径中,并在 Dataflow 管道中将该路径作为 SetupOption sdk_location 变量提供。

I was also struggling for a long time with this setup.我也在这个设置上挣扎了很长时间。 Finally I found a solution which does not need internet access while execution.最后我找到了一个在执行时不需要互联网访问的解决方案。

There are probably multiple ways to do that, but the following two are rather simple.可能有多种方法可以做到这一点,但以下两种方法相当简单。

As a precondition you'll need to create the apache-beam-sdk source archive as following:作为先决条件,您需要创建 apache-beam-sdk 源存档,如下所示:

  1. Clone Apache Beam GitHub克隆Apache Beam GitHub

  2. Switch to required tag eg.切换到所需的标签,例如。 v2.28.0

  3. cd to beam/sdks/python cd 到beam/sdks/python

  4. Create tar.gz source archive of your required beam_sdk version like following:创建所需的 beam_sdk 版本的 tar.gz 源存档,如下所示:

     python setup.py sdist
  5. Now you should have the source archive apache-beam-2.28.0.tar.gz in the path beam/sdks/python/dist/现在您应该在路径beam/sdks/python/dist/拥有源存档apache-beam-2.28.0.tar.gz

Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile选项 1 - 使用 Flex 模板并在 Dockerfile 中复制 Apache_Beam_SDK
Documentation : Google Dataflow Documentation文档: 谷歌数据流文档

  1. Create a Dockerfile --> you have to include this COPY utils/apache-beam-2.28.0.tar.gz /tmp , because this is going to be the path you can set in your SetupOptions.创建一个 Dockerfile --> 你必须包含这个COPY utils/apache-beam-2.28.0.tar.gz /tmp ,因为这将是你可以在 SetupOptions 中设置的路径。
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}

WORKDIR ${WORKDIR}

# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891

# update used packages
RUN apt-get update && apt-get install -y \
    libffi-dev \
 && rm -rf /var/lib/apt/lists/*


COPY setup.py .
COPY main.py .

COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp

ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

RUN python -m pip install --user --upgrade pip setuptools wheel
  1. Set sdk_location to path you've copied the apache_beam_sdk.tar.gz to:将 sdk_location 设置为您已将 apache_beam_sdk.tar.gz 复制到的路径:
    options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
  1. Build the Docker image with Cloud Build使用 Cloud Build 构建 Docker 映像
    gcloud builds submit --tag $TEMPLATE_IMAGE .
  2. Create a Flex template创建 Flex 模板
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
 --image=gcr.io/your-project-id/image-name:tag \
 --sdk-language=PYTHON \
 --metadata-file=metadata.json
  1. Run generated flex-template in your subnetwork (if required)在您的子网中运行生成的 flex-template(如果需要)
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow@your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips

Option 2 - Copy sdk_location from GCS选项 2 - 从 GCS 复制 sdk_location
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location , but it didn't work for me.根据 Beam 文档,您甚至应该能够直接为 Option sdk_location提供 GCS / gs:// 路径,但它对我不起作用。 But the following should work:但以下应该有效:

  1. Upload previously generated archive to a bucket which you're able to access from your Dataflow Job you'd like to execute.将之前生成的存档上传到您可以从要执行的 Dataflow 作业访问的存储桶。 Probably to something like gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz可能类似于gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
  2. Copy the apache-beam-sdk in your source code to eg.将源代码中的 apache-beam-sdk 复制到例如。 /tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage

def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # bucket_name = "your-bucket-name"
    # source_blob_name = "storage-object-name"
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()
    bucket = storage_client.bucket("gs://your-bucket-name")

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
    blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")

  1. Now you can set the sdk_location to the path you've downloaded the sdk archive.现在您可以将 sdk_location 设置为您下载 sdk 存档的路径。
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
  1. Now your Pipeline should be able to run without internet breakout.现在您的管道应该能够在没有互联网突破的情况下运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM