简体   繁体   English

GCP 数据流 Kafka(作为 Azure 事件中心)-> 大查询

[英]GCP Dataflow Kafka (as Azure Event Hub) -> Big Query

TDLR; TDLR;

I have a Kafka-enabled Azure Event Hub that I'm trying to connect to from Google Cloud's Dataflow service to stream the data into Google Big Query.我有一个启用 Kafka 的 Azure 事件中心,我正在尝试从 Google Cloud 的数据流服务连接到 stream 数据到 Google Big Query。 I successfully can use the Kafka CLI to talk to the Azure Event Hub.我可以成功地使用 Kafka CLI 与 Azure 事件中心对话。 However, with GCP, after 5 minutes, I get timeout errors in the GCP Dataflow job window.但是,使用 GCP,5 分钟后,我在 GCP 数据流作业 window 中遇到超时错误。

Azure EH w/ Kafka enabled -> GCP Dataflow -> GCP Big Query table Azure EH 启用 Kafka -> GCP 数据流 -> GCP 大查询表

Details细节

To set up the Kafka-enabled Event Hub, I followed the details on this GitHub page .要设置启用 Kafka 的事件中心,我按照此 GitHub 页面上的详细信息进行操作。 It has the developer add a jaas.conf and client_common.properties .它让开发人员添加一个jaas.confclient_common.properties The jaas.conf includes a reference to the login module along with a username/password. jaas.conf包括对登录模块的引用以及用户名/密码。 The username for Event Hubs with Kafka is $ConnectionString .带有 Kafka 的事件中心的用户名是$ConnectionString The password is the connection string copied from the CLI.密码是从 CLI 复制的连接字符串。 The client_common.properties contains two flags: security.protocol=SASL_SSL and sasl.mechanism=PLAIN . client_common.properties包含两个标志: security.protocol=SASL_SSLsasl.mechanism=PLAIN By configuring these files, I'm able to send and receive data using the Kafka CLI tools and the Azure Event Hub.通过配置这些文件,我可以使用 Kafka CLI 工具和 Azure 事件中心发送和接收数据。 I can see the data streaming from the producer to the consumer through the Azure Event Hub.我可以通过 Azure 事件中心看到从生产者到消费者的数据流。

export KAFKA_OPTS="-Djava.security.auth.login.config=jaas.conf"

(echo -n "1|"; cat message.json | jq . -c) | kafka-conle-producer.sh --topic test-event-hub --broker-list test-eh-namespace.servicebus.windows.net:9093 --producer.config client_common.properties --property "parse.key=true" --property "key.separator=|"

kafka-console-consumer.sh --topic test-event-hub --bootstrap-server test-eh-namespace.servicebus.windows.net:9093 --consumer.config client_common.properties --property "print.key=true"
# prints: 1 { "transaction_time": "2020-07-20 15:14:54", "first_name": "Joe", "last_name": "Smith" }

I modified the Google's Data Flow template for Kafka -> Big Query.我为 Kafka -> Big Query 修改了Google 的数据流模板 There was already a configuration map specified for the reseting of the offset.已经有一个配置 map 指定用于重置偏移量。 I added additional configuration to match the Azure Event Hubs with Kafka tutorial.我添加了额外的配置以匹配 Azure 事件中心与 Kafka 教程。 While not best practice, I add the connection string to the password field to test.虽然不是最佳实践,但我将连接字符串添加到密码字段以进行测试。 When I upload it to the GCP Data Flow engine and run the job, I get timeout errors every 5 minutes in the log and nothing ends up in Google Big Query.当我将它上传到 GCP 数据流引擎并运行该作业时,我每 5 分钟在日志中收到超时错误,并且在 Google Big Query 中没有任何结果。

Job Command作业命令

gcloud dataflow jobs run kafka-test --gcs-location=<removed> --region=us-east1 --worker-zone=us-east4-a --parameters bootstrapServers=test-eh-namespace.servicebus.servicebus.windows.net:9093,inputTopic=test-event-hub,outputTableSpec=project:Kafka_Test.test --service-account-email my-service-account.iam.gserviceaccount.com

Errors in GCP DataFlow GCP 数据流中的错误

# these errors show up in the worker logs
Operation ongoing in step ReadFromKafka/KafkaIO.Read/Read(KafkaUnboundedSource)/DataflowRunner.StreamingUnboundedRead.ReadWithIds for at least 05m00s without outputting or completing in state process at java.lang.Thread.sleep(Native Method) at org.apache.kafka.common.utils.SystemTime.sleep(SystemTime.java:45) at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:366) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1481) at com.google.cloud.teleport.kafka.connector.KafkaUnboundedSource.updatedSpecWithAssignedPartitions(KafkaUnboundedSource.java:85) at com.google.cloud.teleport.kafka.connector.KafkaUnboundedSource.createReader(KafkaUnboundedSource.java:125) at com.google.cloud.teleport.kafka.connector.KafkaUnboundedSource.createReader(KafkaUnboundedSource.java:45) at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$UnboundedReader.iterator(WorkerCustomSources.java:433) at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:186) at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163) at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1426) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:163) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1105) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Execution of work for computation 'S4' on key '0000000000000001' failed with uncaught exception. Work will be retried locally.

# this error shows up in the Job log
Error message from worker: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata

Updated Configuration更新配置

Map<String, Object> props = new HashMap<>();
// azure event hub authentication
props.put("sasl.mechanism", "PLAIN");
props.put("security.protocol", "SASL_SSL")
props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"<removed>\";");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

// https://github.com/Azure/azure-event-hubs-for-kafka/blob/master/CONFIGURATION.md
props.put("request.timeout.ms", 60000);
props.put("session.timeout.ms", 15000);
props.put("max.poll.interval.ms", 30000);
props.put("offset.metadata.max.bytes", 1024);
props.put("connections.max.idle.ms", 180000);
props.put("metadata.max.age.ms", 180000);

Pipeline管道

    PCollectionTuple convertedTableRows =
                pipeline
                        /*
                         * Step #1: Read messages in from Kafka
                         */
                        .apply(
                                "ReadFromKafka",
                                KafkaIO.<String, String>read()
                                        .withConsumerConfigUpdates(ImmutableMap.of(props))
                                        .withBootstrapServers(options.getBootstrapServers())
                                        .withTopics(topicsList)
                                        .withKeyDeserializerAndCoder(
                                                StringDeserializer.class, NullableCoder.of(StringUtf8Coder.of()))
                                        .withValueDeserializerAndCoder(
                                                StringDeserializer.class, NullableCoder.of(StringUtf8Coder.of()))
                                        .withoutMetadata())

                        /*
                         * Step #2: Transform the Kafka Messages into TableRows
                         */
                        .apply("ConvertMessageToTableRow", new MessageToTableRow(options));

Related Questions相关问题

Overview概述

  1. Configure Environment Variables配置环境变量
  2. Modify, Build, & Upload to GCP's Container Registry修改、构建和上传到 GCP 的容器注册表
  3. Create a DataFlow Image Spec创建数据流图像规范
  4. Execute the Image with Dataflow使用数据流执行图像

This application has a complex build process that was ported over from a GCP Data Flow templates .这个应用程序有一个从GCP 数据流模板移植过来的复杂构建过程。 The build process brings over GCP Dataflow docker image construction and deployment scripts that are brought in as dependencies.构建过程带来了作为依赖项引入的 GCP Dataflow docker 映像构建和部署脚本。 Simply clone the repo to get started.只需克隆 repo 即可开始使用。

Prerequisites先决条件

Configure Environment Variables配置环境变量

First step is to set up the environment variables to configure the build and deployment scripts for the given application.第一步是设置环境变量以配置给定应用程序的构建和部署脚本。

export PROJECT=test-project
export IMAGE_NAME=test-project
export BUCKET_NAME=gs://test-project
export TARGET_GCR_IMAGE=gcr.io/${PROJECT}/${IMAGE_NAME}
export BASE_CONTAINER_IMAGE=gcr.io/dataflow-templates-base/java8-template-launcher-base
export BASE_CONTAINER_IMAGE_VERSION=latest
export TEMPLATE_MODULE=kafka-to-bigquery
export APP_ROOT=/template/${TEMPLATE_MODULE}
export COMMAND_SPEC=${APP_ROOT}/resources/${TEMPLATE_MODULE}-command-spec.json
export TEMPLATE_IMAGE_SPEC=${BUCKET_NAME}/images/${TEMPLATE_MODULE}-image-spec.json

export BOOTSTRAP=<event_grid_name>.servicebus.windows.net:9093
export TOPICS=<event_grid_topic_name>
export OUTPUT_TABLE=test-project:<schema>.test
export AUTHENTICATION_STRING="org.apache.kafka.common.security.plain.PlainLoginModule required username=\"\$ConnectionString\" password=\"<EVENT_GRID_TOPIC_APP_SECRET>\";"

Modify, Build, & Upload Project修改、构建和上传项目

Before building, you will need to update./kafka-to-bigquery/src/main/java/com/google/cloud/teleport/v2/templates/KafkaToBigQuery.java file with the additional content to handle the authentication string:在构建之前,您需要更新 ./kafka-to-bigquery/src/main/java/com/google/cloud/teleport/v2/templates/KafkaToBigQuery.java 文件以处理身份验证字符串:

public class KafkaToBigQuery {

    public interface Options extends PipelineOptions {

        @Description("Kafka Authentication String")
        @Required
        String getAuthenticationString();

        void setAuthenticationString(String authenticationString);
    }

    public static PipelineResult run(Options options) {

        Map<String, Object> props = new HashMap<>();
        props.put("sasl.mechanism", "PLAIN");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", options.getAuthenticationString());

//      https://github.com/Azure/azure-event-hubs-for-kafka/blob/master/CONFIGURATION.md
        props.put("request.timeout.ms", 60000);
        props.put("session.timeout.ms", 15000);
        props.put("max.poll.interval.ms", 30000);
        props.put("offset.metadata.max.bytes", 1024);

        props.put("connections.max.idle.ms", 180000);
        props.put("metadata.max.age.ms", 180000);


        PCollectionTuple convertedTableRows =
                pipeline
                        /*
                         * Step #1: Read messages in from Kafka
                         */
                        .apply(
                                "ReadFromKafka",
                                KafkaIO.<String, String>read()
                                        .withConsumerConfigUpdates(props)
                                        .withBootstrapServers(options.getBootstrapServers())
                                        .withTopics(topicsList)
                                        .withKeyDeserializerAndCoder(
                                                StringDeserializer.class, NullableCoder.of(StringUtf8Coder.of()))
                                        .withValueDeserializerAndCoder(
                                                StringDeserializer.class, NullableCoder.of(StringUtf8Coder.of()))
                                        .withoutMetadata())

    }
}

Once you have set up the project and the file changed, the next phase is building the docker image to upload to Google's Container Registry.设置项目并更改文件后,下一阶段是构建 docker 映像以上传到 Google 的 Container Registry。 This command will also build the common files that interact with miscellaneous Google services.此命令还将构建与其他 Google 服务交互的common文件。 If the build is successful, the container will be pushed into Google Container Registry (GCR).如果构建成功,容器将被推送到 Google Container Registry (GCR)。 From the GCR, you can deploy into Google Dataflow.您可以从 GCR 部署到 Google Dataflow。

mvn clean package -Dimage=${TARGET_GCR_IMAGE} \
    -Dbase-container-image=${BASE_CONTAINER_IMAGE} \
    -Dbase-container-image.version=${BASE_CONTAINER_IMAGE_VERSION} \
    -Dapp-root=${APP_ROOT} \
    -Dcommand-spec=${COMMAND_SPEC} \
    -am -pl ${TEMPLATE_MODULE}

Create & Upload Image Spec (only done once)创建和上传图像规范(只完成一次)

Prior to launching the project in Dataflow, the Dataflow runner needs a Flex Template to know how to execute the project.在 Dataflow 中启动项目之前,Dataflow 运行器需要一个 Flex 模板来了解如何执行项目。 The Flex Template is a JSON metadata file that contains parameters and instructions to construct the GCP Dataflow application. Flex 模板是一个 JSON 元数据文件,其中包含用于构建 GCP 数据流应用程序的参数和指令。 A Flex Template must be uploaded to Google Cloud Storage (GCS) to the corresponding bucket name set up by the environment variables.必须将 Flex 模板上传到 Google Cloud Storage (GCS) 到由环境变量设置的相应存储桶名称。 This step must match this environment variable TEMPLATE_IMAGE_SPEC=${BUCKET_NAME}/images/${TEMPLATE_MODULE}-image-spec.json .此步骤必须与此环境变量TEMPLATE_IMAGE_SPEC=${BUCKET_NAME}/images/${TEMPLATE_MODULE}-image-spec.json

{
  "image": "gcr.io/<my-project-url>:latest",
  "metadata": {
    "name": "Streaming data generator",
    "description": "Generates Synthetic data as per user specified schema at a fixed QPS and writes to Sink of user choice.",
    "parameters": [
      {
        "name": "authenticationString",
        "label": "Kafka Event Hub Authentication String",
        "helpText": "The authentication string for the Azure Event Hub",
        "is_optional": false,
        "regexes": [
          ".+"
        ],
        "paramType": "TEXT"
      },
      {
        "name": "bootstrapServers",
        "label": "Kafka Broker IP",
        "helpText": "The Kafka broker IP",
        "is_optional": false,
        "regexes": [
          ".+"
        ],
        "paramType": "TEXT"
      },
      {
        "name": "inputTopics",
        "label": "PubSub Topic name",
        "helpText": "The name of the topic to which the pipeline should publish data. For example, projects/<project-id>/topics/<topic-name> - should match the Event Grid Topic",
        "is_optional": false,
        "regexes": [
          ".+"
        ],
        "paramType": "PUBSUB_TOPIC"
      },
      {
        "name": "outputTableSpec",
        "label": "Output BigQuery table",
        "helpText": "Output BigQuery table. For example, <project>:<dataset>.<table_name>. Mandatory when sinkType is BIGQUERY.",
        "isOptional": false,
        "regexes": [
          ".+:.+\\..+"
        ],
        "paramType": "TEXT"
      },
      {
        "name": "outputDeadletterTable",
        "label": "Output Deadletter table",
        "helpText": "Output Deadletter table. For example, <project>:<dataset>.<table_name>",
        "isOptional": true,
        "regexes": [
          ".+:.+\\..+"
        ],
        "paramType": "TEXT"
      }
    ]
  },
  "sdk_info": {
    "language": "JAVA"
  }
}

Execute the Image with Dataflow使用数据流执行图像

Once you have an image uploaded to GCP and have uploaded a Flex Template, you can launch the Dataflow application.将图像上传到 GCP 并上传 Flex 模板后,您可以启动 Dataflow 应用程序。 The parameters must match the parameters included in the Flex Template's metadata section.这些参数必须与 Flex 模板的元数据部分中包含的参数相匹配。

export JOB_NAME="${TEMPLATE_MODULE}-`date +%Y%m%d-%H%M%S-%N`"
gcloud beta dataflow flex-template run ${JOB_NAME} \
        --project=${PROJECT} --region=us-east1 \
        --template-file-gcs-location=${TEMPLATE_IMAGE_SPEC} \
        --parameters ^~^outputTableSpec=${OUTPUT_TABLE}~inputTopics=${TOPICS}~bootstrapServers=${BOOTSTRAP}~authenticationString="${AUTHENTICATION_STRING}" \
        --verbosity=info \
        --service-account-email=<service_account_to_execute_service>

Once you run this command, check in the GCP Cloud Console to view the status.运行此命令后,请在 GCP Cloud Console 中查看状态。 The Dataflow Job should be working successfully at this point pulling messages from the Azure Event Grid and inserting them into Google Big Query.此时数据流作业应该可以成功地从 Azure 事件网格中提取消息并将它们插入到 Google Big Query 中。

The GCP repo assumes Google Big Query/Dataflow will dynamically make the tables to have the correct rows, but YMMV as I found this finky. GCP 存储库假定 Google Big Query/Dataflow 将动态地使表具有正确的行,但是 YMMV 因为我发现这个很奇怪。 The work around is to create the schema in Google Big Query in advance of running the Dataflow job.解决方法是在运行 Dataflow 作业之前在 Google Big Query 中创建架构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM