简体   繁体   English

使用 Apache Beam 和 Dataflow 从按日期分区的动态 GCS 存储桶中读取

[英]Read from dynamic GCS bucket partitioned by date using Apache Beam and Dataflow

Right now I'm sending failed records from an API call to a GCS bucket thats partitioned by day现在我正在将失败的记录从 API 调用发送到按天分区的 GCS 存储桶

ex:前任:

gs://path/to/file/2022/07/01

gs://path/to/file/2022/07/02

etc.. ETC..

From there, I would like to schedule a batch job to retry these failed records the next day using Apache Beam and Dataflow.从那里,我想安排一个批处理作业,以在第二天使用 Apache Beam 和 Dataflow 重试这些失败的记录。 The issue is that the date at the end of the GCS path is only added when the initial template was uploaded to GCP and remains fixed at that date regardless of when you run the job similar to this and this .问题是 GCS 路径末尾的日期仅在初始模板上传到 GCP 时添加,并且无论何时运行类似于thisthis的作业都保持在该日期。 I'm using a ValueProvider but I cannot figure out a work around for this.我正在使用 ValueProvider 但我无法解决这个问题。

I've found that when I run the pipeline locally, everything works great, but in Dataflow, only the DoFns within the expand function actually get run.我发现当我在本地运行管道时,一切正常,但在 Dataflow 中,只有 expand 函数中的 DoFns 实际运行。

Essentially what I'm doing here is:基本上我在这里做的是:

  1. getting the initial gcs path and appending current date minus 24 hours to the end in fileMatch获取初始 gcs 路径并将当前日期减去 24 小时附加到fileMatch的末尾

  2. calling FileIO.readMatches() to convert each match() to a ReadableFile调用 FileIO.readMatches() 将每个match()转换为ReadableFile

  3. calling MatchGCSFiles which contains the exact code below that uses the value providers to get the current date and append to the GCS path.调用 MatchGCSFiles ,其中包含以下使用值提供程序获取当前日期并附加到 GCS 路径的确切代码。 I did this to override the existing path because this was the only way to get it to work as I thought I understood a DoFn cannot take in empty input (still learning Beam)我这样做是为了覆盖现有路径,因为这是让它工作的唯一方法,因为我认为我理解 DoFn 不能接受空输入(仍在学习 Beam)

  4. call FileIO.readMatches() again to convert the new match() to a new ReadableFile then call the API.再次调用FileIO.readMatches()将新的match()转换为新的ReadableFile ,然后调用 API。

     String dateFormat = "yyyy/MM/dd"; ValueProvider<String> date = new ValueProvider<String>() { @Override public String get() { String currentDate = Instant.now().minus(86400000).toDateTime(DateTimeZone.UTC).toString(dateFormat); return currentDate; } @Override public boolean isAccessible() { return true; } }; ValueProvider<String> gcsPathWithDate = new ValueProvider<String>() { @Override public String get() { return String.format("%s/%s/*/*.json", gcsPathPrefix, date.get()); } @Override public boolean isAccessible() { return true; } }; fileMatch = FileIO.match().filepattern(gcsPathWithDate.get()); } PCollectionTuple mixedPColl = input .getPipeline() .apply("File match", fileMatch) .apply("applying read matches", FileIO.readMatches()) .apply("matching files", ParDo.of(new MatchGCSFiles())) .apply("applying read matches", FileIO.readMatches()). //problem here .apply("Read failed events from GCS", ParDo.of(new ReadFromGCS())) .apply(//call API)...

The problem is in the second FileIO.readMatches(), the return type does not match: reason: no instance(s) of type variable(s) exist so that PCollection conforms to PCollection问题出在第二个 FileIO.readMatches() 中,返回类型不匹配:原因:不存在类型变量的实例,因此 PCollection 符合 PCollection

I've tried different work arounds for this but none seem to work.我为此尝试了不同的解决方法,但似乎都没有。

Is there another/better way to dynamically add/replace the date in the GCS path?是否有另一种/更好的方法来动态添加/替换 GCS 路径中的日期? I'm still learning Beam so if I'm doing something wrong please let me know.我还在学习 Beam,所以如果我做错了什么,请告诉我。

Thanks in advance.提前致谢。

It seems that there are a few improvements you can apply:您似乎可以应用一些改进:

  • Your pipeline always reads the file from yesterday.您的管道总是从昨天读取文件。 So you don't need a runtime parameter (and the usage of ValueProvider in your example is not right since it's not used as a runtime parameter provided by a pipeline option).因此,您不需要运行时参数(并且在您的示例中使用 ValueProvider 是不正确的,因为它没有用作管道选项提供的运行时参数)。

    • If you do need to set an arbitrary date, you'll have to follow the example to create a pipeline option.如果您确实需要设置任意日期,则必须按照示例创建管道选项。
     public interface MyOptions extends PipelineOptions { @Description("Date in yyyy-MM-dd") @Default.String("2022-01-01") ValueProvider<String> getDateValue(); void setDateValue(ValueProvider<String> value); }

    Then use it in your DoFns.然后在你的 DoFns 中使用它。

     ... @ProcessElement public void process(ProcessContext c) { MyOptions ops = c.getPipelineOptions().as(MyOptions.class); // Use it. ...(ops.getDateValue()) } ...

    When you start a job, you have to provide the pipeline option of that arbitrary date just like when you set any other PipelineOptions.当您开始工作时,您必须提供该任意日期的管道选项,就像您设置任何其他 PipelineOptions 时一样。

  • You can always get yesterday's date by Get yesterday's date using Date .您始终可以通过使用 Date 获取昨天的日期来获取昨天的日期

     Instant now = Instant.now(); Instant yesterday = now.minus(1, ChronoUnit.DAYS);

    and then use it directly in your pipeline.然后直接在您的管道中使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM