简体   繁体   English

如何从 Java 中的 Cloud Function 触发 Cloud Dataflow 管道作业?

[英]How to trigger Cloud Dataflow pipeline job from Cloud Function in Java?

I have a requirement to trigger the Cloud Dataflow pipeline from Cloud Functions.我需要从 Cloud Functions 触发 Cloud Dataflow 管道。 But the Cloud function must be written in Java. So the Trigger for Cloud Function is Google Cloud Storage's Finalise/Create Event, ie, when a file is uploaded in a GCS bucket, the Cloud Function must trigger the Cloud dataflow.但是Cloud function必须写成Java。所以Cloud Function的Trigger是Google Cloud Storage的Finalise/Create Event,即GCS bucket中上传文件时,Cloud Function必须触发Cloud dataflow。

When I create a dataflow pipeline (batch) and I execute the pipeline, it creates a Dataflow pipeline template and creates a Dataflow job.当我创建一个数据流管道(批处理)并执行该管道时,它会创建一个数据流管道模板并创建一个数据流作业。

But when I create a cloud function in Java, and a file is uploaded, the status just says "ok", but it does not trigger the dataflow pipeline.但是当我在 Java 中创建云 function 并上传文件时,状态只是显示“ok”,但它不会触发数据流管道。

Cloud function云 function

package com.example;

import com.example.Example.GCSEvent;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.http.HttpRequestInitializer;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.dataflow.Dataflow;
import com.google.api.services.dataflow.model.CreateJobFromTemplateRequest;
import com.google.api.services.dataflow.model.RuntimeEnvironment;
import com.google.auth.http.HttpCredentialsAdapter;
import com.google.auth.oauth2.GoogleCredentials;
import com.google.cloud.functions.BackgroundFunction;
import com.google.cloud.functions.Context;

import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.HashMap;
import java.util.logging.Logger;

public class Example implements BackgroundFunction<GCSEvent> {
    private static final Logger logger = Logger.getLogger(Example.class.getName());

    @Override
    public void accept(GCSEvent event, Context context) throws IOException, GeneralSecurityException {
        logger.info("Event: " + context.eventId());
        logger.info("Event Type: " + context.eventType());


        HttpTransport httpTransport = GoogleNetHttpTransport.newTrustedTransport();
        JsonFactory jsonFactory = JacksonFactory.getDefaultInstance();

        GoogleCredentials credentials = GoogleCredentials.getApplicationDefault();
        HttpRequestInitializer requestInitializer = new HttpCredentialsAdapter(credentials);


        Dataflow dataflowService = new Dataflow.Builder(httpTransport, jsonFactory, requestInitializer)
                .setApplicationName("Google Dataflow function Demo")
                .build();

        String projectId = "my-project-id";


        RuntimeEnvironment runtimeEnvironment = new RuntimeEnvironment();
        runtimeEnvironment.setBypassTempDirValidation(false);
        runtimeEnvironment.setTempLocation("gs://my-dataflow-job-bucket/tmp");
        CreateJobFromTemplateRequest createJobFromTemplateRequest = new CreateJobFromTemplateRequest();
        createJobFromTemplateRequest.setEnvironment(runtimeEnvironment);
        createJobFromTemplateRequest.setLocation("us-central1");
        createJobFromTemplateRequest.setGcsPath("gs://my-dataflow-job-bucket-staging/templates/cloud-dataflow-template");
        createJobFromTemplateRequest.setJobName("Dataflow-Cloud-Job");
        createJobFromTemplateRequest.setParameters(new HashMap<String,String>());
        createJobFromTemplateRequest.getParameters().put("inputFile","gs://cloud-dataflow-bucket-input/*.txt");
        dataflowService.projects().templates().create(projectId,createJobFromTemplateRequest);

        throw new UnsupportedOperationException("Not supported yet.");
    }

    public static class GCSEvent {
        String bucket;
        String name;
        String metageneration;
    }


}

pom.xml(cloud function) pom.xml(云函数)

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>cloudfunctions</groupId>
  <artifactId>http-function</artifactId>
  <version>1.0-SNAPSHOT</version>

  <properties>
    <maven.compiler.target>11</maven.compiler.target>
    <maven.compiler.source>11</maven.compiler.source>
  </properties>

  <dependencies>
  <!-- https://mvnrepository.com/artifact/com.google.auth/google-auth-library-credentials -->
<dependency>
    <groupId>com.google.auth</groupId>
    <artifactId>google-auth-library-credentials</artifactId>
    <version>0.21.1</version>
</dependency>

  <dependency>
    <groupId>com.google.apis</groupId>
    <artifactId>google-api-services-dataflow</artifactId>
    <version>v1b3-rev207-1.20.0</version>
</dependency>
    <dependency>
      <groupId>com.google.cloud.functions</groupId>
      <artifactId>functions-framework-api</artifactId>
      <version>1.0.1</version>
    </dependency>
         <dependency>
    <groupId>com.google.auth</groupId>
    <artifactId>google-auth-library-oauth2-http</artifactId>
    <version>0.21.1</version>
</dependency>
  </dependencies>

  <!-- Required for Java 11 functions in the inline editor -->
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.8.1</version>
        <configuration>
          <excludes>
            <exclude>.google/</exclude>
          </excludes>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

cloud function logs云 function 日志

在此处输入图像描述

I went through the below blogs (adding for reference) where they have triggered dataflow from cloud storage via cloud function. But the code has been written in either Node.js or python. But my cloud function must be written in java.我浏览了下面的博客(添加以供参考),它们通过云 function 触发了来自云存储的数据流。但是代码已经写在 Node.js 或 python中。但是我的云 function 必须写在 java 中。

Triggering Dataflow pipeline via cloud functions in Node.js通过 Node.js 中的云功能触发数据流管道

https://dzone.com/articles/triggering-dataflow-pipelines-with-cloud-functions https://dzone.com/articles/triggering-dataflow-pipelines-with-cloud-functions

Triggering dataflow pipeline via cloud functions using python使用 python 通过云函数触发数据流管道

https://medium.com/google-cloud/how-to-kick-off-a-dataflow-pipeline-via-cloud-functions-696927975d4e https://medium.com/google-cloud/how-to-kick-off-a-dataflow-pipeline-via-cloud-functions-696927975d4e

Any help on this is very much appreciated.非常感谢对此的任何帮助。

RuntimeEnvironment runtimeEnvironment = new RuntimeEnvironment();
runtimeEnvironment.setBypassTempDirValidation(false);
runtimeEnvironment.setTempLocation("gs://karthiksfirstbucket/temp1");

LaunchTemplateParameters launchTemplateParameters = new LaunchTemplateParameters();
launchTemplateParameters.setEnvironment(runtimeEnvironment);
launchTemplateParameters.setJobName("newJob" + (new Date()).getTime());

Map<String, String> params = new HashMap<String, String>();
params.put("inputFile", "gs://karthiksfirstbucket/sample.txt");
params.put("output", "gs://karthiksfirstbucket/count1");
launchTemplateParameters.setParameters(params);
writer.write("4");
       
Dataflow.Projects.Templates.Launch launch = dataflowService.projects().templates().launch(projectId, launchTemplateParameters);            
launch.setGcsPath("gs://dataflow-templates-us-central1/latest/Word_Count");
launch.execute();

The above code launches a template and executes the dataflow pipeline上面的代码启动一个模板并执行数据流管道

  1. using application default credentials(Which can be changed to user cred or service cred)使用应用程序默认凭据(可以更改为用户凭据或服务凭据)
  2. region is default region(Which can be changed). region 是默认区域(可以更改)。
  3. creates a job for every HTTP trigger(Trigger can be changed).为每个 HTTP 触发器创建一个作业(可以更改触发器)。

The complete code can be found below:完整的代码可以在下面找到:

https://github.com/karthikeyan1127/Java_CloudFunction_DataFlow/blob/master/Hello.java https://github.com/karthikeyan1127/Java_CloudFunction_DataFlow/blob/master/Hello.java

This is my solution using the new data flow dependencies这是我使用新数据流依赖项的解决方案

public class Example implements BackgroundFunction<Example.GCSEvent> {
    private static final Logger logger = Logger.getLogger(Example.class.getName());

    @Override
    public void accept(GCSEvent event, Context context) throws Exception {
        String filename = event.name;

        logger.info("Processing file: " + filename);
        logger.info("Bucket name" + event.bucket);

        String projectId = "cedar-router-268801";
        String region = "us-central1";
        String tempLocation = "gs://cedar-router-beam-poc/temp";
        String templateLocation = "gs://cedar-router-beam-poc/template/poc-template.json";

        logger.info("path" + String.format("gs://%s/%s", event.bucket, filename));
        String scenario = filename.substring(0, 3); //it comes TWO OR ONE

        logger.info("scneario " + scenario);

        Map<String, String> params = Map.of("sourceFile", String.format("%s/%s", event.bucket, filename),
                "scenario", scenario,
                "years", "2013,2014",
                "targetFile", "gs://cedar-router-beam-poc-kms/result/testfile");
        
        extractedJob(projectId, region, tempLocation, templateLocation, params);
    }

    private static void extractedJob(String projectId,
                                     String region,
                                     String tempLocation,
                                     String templateLocation,
                                     Map<String, String> params) throws Exception {

        HttpTransport httpTransport = GoogleApacheHttpTransport.newTrustedTransport();
        JsonFactory jsonFactory = GsonFactory.getDefaultInstance();
        GoogleCredentials credentials = GoogleCredentials.getApplicationDefault();
        HttpRequestInitializer httpRequestInitializer = new RetryHttpRequestInitializer(ImmutableList.of(404));
        ChainingHttpRequestInitializer chainingHttpRequestInitializer =
                new ChainingHttpRequestInitializer(new HttpCredentialsAdapter(credentials), httpRequestInitializer);

        Dataflow dataflowService = new Dataflow.Builder(httpTransport, jsonFactory, chainingHttpRequestInitializer)
                .setApplicationName("Dataflow from Cloud function")
                .build();

        FlexTemplateRuntimeEnvironment runtimeEnvironment = new FlexTemplateRuntimeEnvironment();
        runtimeEnvironment.setTempLocation(tempLocation);

        LaunchFlexTemplateParameter launchFlexTemplateParameter = new LaunchFlexTemplateParameter();
        launchFlexTemplateParameter.setEnvironment(runtimeEnvironment);
        String jobName = params.get("sourceFile").substring(34, 49).replace("_","");
        logger.info("job name" + jobName);
        launchFlexTemplateParameter.setJobName("job" + LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMddHHmmss")));

        launchFlexTemplateParameter.setContainerSpecGcsPath(templateLocation);
        launchFlexTemplateParameter.setParameters(params);

        LaunchFlexTemplateRequest launchFlexTemplateRequest = new LaunchFlexTemplateRequest();
        launchFlexTemplateRequest.setLaunchParameter(launchFlexTemplateParameter);


        Launch launch = dataflowService.projects()
                .locations()
                .flexTemplates()
                .launch(projectId, region, launchFlexTemplateRequest);

        launch.execute();
        logger.info("running job");
    }


    public static class GCSEvent {
        String bucket;
        String name;
        String metageneration;
    }
Just adapt it to your case 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在特定文件集从谷歌云到达云存储时启动云数据流管道 function - how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function 从云功能启动数据流管道时出错 - Error when launching dataflow pipeline from cloud function Cloud Dataflow 作业的调度 - Schedulling for Cloud Dataflow Job Trigger in cloud function 监控云存储并触发数据流 - Trigger in cloud function that monitors cloud storage and triggers dataflow 触发云存储 - 数据流 - Trigger Cloud Storage - Dataflow function云中dataproc作业状态如何触发事件监控? - How to trigger an event to monitor dataproc job status in cloud function? Cloud Dataflow 中的失败作业:启用 Dataflow API - Failed job in Cloud Dataflow: enable Dataflow API Cloud Run Job 完成所有任务后,如何触发 Cloud Run Job 或 Cloud Function? - How I can trigger a Cloud Run Job or Cloud Function after a Cloud Run Job finishes all its tasks? 如何运行用 Golang 编写的 GCP Cloud Function 以运行数据流作业以将文本文件导入 Spanner? - How to run a GCP Cloud Function written in Golang to run a Dataflow job to import text file to Spanner? 如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业? - How to run a Python Google Cloud Dataflow job with a custom Docker image?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM