简体   繁体   English

如何在 GCP 上使用 Apache Beam 数据流手动将可执行文件复制到工作人员

[英]How to manually copy executable to workers with Apache Beam Dataflow on GCP

Somewhat new to Beam and GCP. Beam 和 GCP 有点新。 Following this document and using the Beam 'subprocess' examples I've been working on a simple Java Pipeline that runs a C binary.按照本文档并使用 Beam 的“子流程”示例,我一直在研究一个简单的 Java 管道,该管道运行 C 二进制文件。 It runs fine when using the directRunner and I'm now trying to get it to run in the cloud.使用 directRunner 时它运行良好,我现在正试图让它在云中运行。 With the file staged in a gs buckets, I get the error: 'Cannot run program "gs://mybucketname/tmp/grid_working_files/Echo": error=2, No such file or directory' which makes sense since I guess you can't execute directly out of cloud storage?随着文件暂存在 gs 存储桶中,我收到错误消息:'无法运行程序“gs://mybucketname/tmp/grid_working_files/Echo”:错误 = 2,没有这样的文件或目录'这是有道理的,因为我猜你可以'不直接从云存储中执行? Where I'm stuck now is how to move the executable to the worker.我现在卡住的地方是如何将可执行文件移动到工作人员。 The document states:该文件指出:

When you use a native Apache Beam language (Java or Python), the Beam SDK automatically moves all required code to the workers.当您使用本机 Apache Beam 语言(Java 或 Python)时,Beam SDK 会自动将所有必需的代码移动到工作人员。 However, when you make a call to external code, you need to move the code manually.但是,当您调用外部代码时,您需要手动移动代码。  To move the code, you do the following:  要移动代码,请执行以下操作:

  1. Store the compiled external code, along with versioning information, in Cloud Storage.将编译后的外部代码连同版本信息一起存储在 Cloud Storage 中。
  2. In the @Setup method, create a synchronized block to check whether the code file is available on the local resource.在@Setup 方法中,创建一个同步块来检查代码文件是否在本地资源上可用。 Rather than implementing a physical check, you can confirm availability using a static variable when the first thread finishes.您可以在第一个线程完成时使用 static 变量确认可用性,而不是实施物理检查。
  3. If the file isn't available, use the Cloud Storage client library to pull the file from the Cloud Storage bucket to the local worker.如果文件不可用,请使用 Cloud Storage 客户端库将文件从 Cloud Storage 存储分区拉取到本地工作器。 A recommended approach is to use the Beam FileSystems class for this task.推荐的方法是为此任务使用 Beam FileSystems 类。
  4. After the file is moved, confirm that the execute bit is set on the code file.移动文件后,确认在代码文件上设置了执行位。
  5. In a production system, check the hash of the binaries to ensure that the file has been copied correctly.在生产系统中,检查二进制文件的 hash 以确保文件已正确复制。

I've looked at the FileSystems class, and I think I understand it, but what I don't know is where I need to copy the files to.我查看了 FileSystems class,我想我明白了,但我不知道我需要将文件复制到哪里。 Is there a known directory or filepath that the workers use?是否有工作人员使用的已知目录或文件路径? I'm using the Dataflow runner.我正在使用数据流运行器。

You can copy the file to wherever you want in your workers local filesystem, eg you could use the tempfile module to create a new, empty temporary directory in which to copy your executable before running.您可以将文件复制到工作人员本地文件系统中您想要的任何位置,例如,您可以使用tempfile模块创建一个新的空临时目录,在运行之前复制您的可执行文件。

Using custom containers might be a good solution to this as well.使用自定义容器也可能是一个很好的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Apache Beam (GCP Dataflow) 写入 ConfluentCloud - Write to ConfluentCloud from Apache Beam (GCP Dataflow) 如何使用 Google Cloud Dataflow 增加 Apache Beam 管道工作线程上的线程堆栈大小? - How can I increase the thread stack size on Apache Beam pipeline workers with Google Cloud Dataflow? GCP Apache Beam 数据流 JDBC IO 连接错误 - GCP Apache Beam Dataflow JDBC IO Connection Error Apache Python 与 Java 之间的光束性能在 GCP 数据流上运行 - Apache Beam Performance Between Python Vs Java Running on GCP Dataflow Apache Beam With GCP Dataflow 抛出 INVALID_ARGUMENT - Apache Beam With GCP Dataflow throws INVALID_ARGUMENT 如何以编程方式从工作人员处终止 Beam Dataflow 作业 - How to kill a Beam Dataflow job programmatically from workers 如何在 Apache Beam / Google Dataflow 中使用 ParseJsons? - How to use ParseJsons in Apache Beam / Google Dataflow? Apache Beam数据流BigQuery - Apache Beam Dataflow BigQuery JAVA - Apache BEAM- GCP:GroupByKey 在 Direct Runner 中运行良好,但在 Dataflow runner 中失败 - JAVA - Apache BEAM- GCP: GroupByKey works fine with Direct Runner but fails with Dataflow runner 在 GCP 数据流使用 apache beam 完成工作后,有什么方法可以进行处理吗? - Is there any way to do processing after GCP dataflow has completed the job using apache beam?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM