简体   繁体   English

如何在apache_beam中的MapElements中动态添加字段?

[英]How can I dynamically add field in MapElements in apache_beam?

I would appreciate if someone could help me out writing java code for apache_beam (2.13.0) . 如果有人可以帮助我为apache_beam(2.13.0)编写Java代码,我将不胜感激。

In python , you could dynamically add field using 1 to 1 mapping of Map function. python中 ,您可以使用Map函数的一对一映射动态添加字段。

Code

#!/usr/bin/env

import apache_beam as beam
from apache_beam.io.textio import WriteToText

def addoutput(line):
    return [line, "Its weekend!"]

with beam.Pipeline() as p:
    ( p
      | beam.Create(["blah"])
      | beam.Map(addoutput)
      | WriteToText(file_path_prefix="/tmp/sample")
    )

Result 结果

['blah', 'Its weekend!']

However, when I try to do same thing with java, I get an compile error in maven . 但是,当我尝试使用java做同样的事情时,我在maven中遇到了编译错误。

Code

public class SampleTextIO
{
    static class AddFieldFn extends DoFn<String, String> {

        @ProcessElement
        public void processElement(@Element String word, OutputReceiver<String> receiver) {

            receiver.output(word);
            receiver.output("Its weekend!");
        }
    }

    public static void main ( String[] args ) {
        System.out.println( "Main class for DirectRunner" );

        // Pipeline create using default runner (DirectRunnter)
        // Interface: PipelineOptions
        PipelineOptions options = PipelineOptionsFactory.create();

        Pipeline p = Pipeline.create(options);

        // Example pcollection
        final List<String> LINES = Arrays.asList(
            "blah"
        );

        // Read lines from file
        p.apply(Create.of(LINES))
         .apply(MapElements.via(new AddFieldFn()))
         .apply(TextIO.write().to("/tmp/test-out"));

        p.run().waitUntilFinish();
    }
}

Result 结果

[ERROR] /home/ywatanabe/git/google-data-engineer/Data_Science_on_the_Google_Cloud_Platform/Ch04/java/directrunner/src/main/java/com/example/SampleTextIO.java:[43,28] no suitable method found for via(com.example.SampleTextIO.AddFieldFn)
[ERROR]     method org.apache.beam.sdk.transforms.MapElements.<InputT,OutputT>via(org.apache.beam.sdk.transforms.InferableFunction<InputT,OutputT>) is not applicable
[ERROR]       (cannot infer type-variable(s) InputT,OutputT
[ERROR]         (argument mismatch; com.example.SampleTextIO.AddFieldFn cannot be converted to org.apache.beam.sdk.transforms.InferableFunction<InputT,OutputT>))
[ERROR]     method org.apache.beam.sdk.transforms.MapElements.<InputT,OutputT>via(org.apache.beam.sdk.transforms.SimpleFunction<InputT,OutputT>) is not applicable
[ERROR]       (cannot infer type-variable(s) InputT,OutputT
[ERROR]         (argument mismatch; com.example.SampleTextIO.AddFieldFn cannot be converted to org.apache.beam.sdk.transforms.SimpleFunction<InputT,OutputT>))
[ERROR]     method org.apache.beam.sdk.transforms.MapElements.via(org.apache.beam.sdk.transforms.ProcessFunction) is not applicable
[ERROR]       (argument mismatch; com.example.SampleTextIO.AddFieldFn cannot be converted to org.apache.beam.sdk.transforms.ProcessFunction)
[ERROR]     method org.apache.beam.sdk.transforms.MapElements.via(org.apache.beam.sdk.transforms.SerializableFunction) is not applicable
[ERROR]       (argument mismatch; com.example.SampleTextIO.AddFieldFn cannot be converted to org.apache.beam.sdk.transforms.SerializableFunction)
[ERROR]     method org.apache.beam.sdk.transforms.MapElements.via(org.apache.beam.sdk.transforms.Contextful) is not applicable
[ERROR]       (argument mismatch; com.example.SampleTextIO.AddFieldFn cannot be converted to org.apache.beam.sdk.transforms.Contextful)

Reading the javadoc , MapElements supports Processfunction but does not work well in my case. 阅读的javadoc ,MapElements支持Processfunction但在我的情况下不能很好地工作。

How can I dynamically add fields like python in java ? 如何在Java中 动态添加类似python的字段?

This is because the via method of mapElements expects one of the following: InferableFunction , SimpleFunction , ProcessFunction , SerializableFunction , Contextful . 这是因为mapElementsvia方法需要以下之一: InferableFunctionSimpleFunctionProcessFunctionSerializableFunctionContextful In your example AddFieldFn extends DoFn instead. 在您的示例中, AddFieldFn扩展了DoFn Also, as per comparing with the Python example it seems that you want to output a list of two elements instead of yielding two different rows. 另外,与Python示例相比,您似乎希望输出两个元素的列表,而不是产生两个不同的行。

Three examples on how to do that: 有关如何执行此操作的三个示例:

// via ProcessFunction
PCollection p1 = p.apply(Create.of(LINES))
  .apply(MapElements.into(TypeDescriptors.lists(TypeDescriptors.strings()))
                    .via((String word) -> (Arrays.asList(word, "Its weekend!"))))
  .apply(ParDo.of(new PrintResultsFn()));

// via in-line SimpleFunction
PCollection p2 = p.apply(Create.of(LINES))
  .apply(MapElements.via(new SimpleFunction<String, List<String>>() {
    public List<String> apply(String word) {
      return Arrays.asList(word, "Its weekend!");
    }}))
  .apply(ParDo.of(new PrintResultsFn()));

// via AddFieldFn class 
PCollection p3 = p.apply(Create.of(LINES))
  .apply(MapElements.via(new AddFieldFn()))
  .apply(ParDo.of(new PrintResultsFn()));

where AddFieldFn is: 其中AddFieldFn为:

// define AddFieldFn extending from SimpleFunction and overriding apply method
static class AddFieldFn extends SimpleFunction<String, List<String>> {
    @Override
    public List<String> apply(String word) {
        return Arrays.asList(word, "Its weekend!");
    }
}

and PrintResultsFn verifies the rows: PrintResultsFn验证行:

// just print the results
static class PrintResultsFn extends DoFn<List<String>, Void> {
    @ProcessElement
    public void processElement(@Element List<String> words) {
        Log.info(Arrays.toString(words.toArray()));
    }
}

Which should print the desired output: 哪个应该打印所需的输出:

Jun 23, 2019 8:00:03 PM com.dataflow.samples.SampleTextIO$PrintResultsFn processElement
INFO: [blah, Its weekend!]
Jun 23, 2019 8:00:03 PM com.dataflow.samples.SampleTextIO$PrintResultsFn processElement
INFO: [blah, Its weekend!]
Jun 23, 2019 8:00:03 PM com.dataflow.samples.SampleTextIO$PrintResultsFn processElement
INFO: [blah, Its weekend!]

Full code here . 完整代码在这里 Tested with DirectRunner and Java SDK 2.13.0 经过DirectRunner和Java SDK 2.13.0的测试

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Apache Beam 中一起使用 MapElements 和 KV? - How do I use MapElements and KV in together in Apache Beam? 我收到错误:使用Kotlin时,Apache Beam中MapElements转换产生“重载分辨率歧义” - I get error: “Overload resolution ambiguity” from MapElements transform in Apache Beam when using Kotlin 如何将 Apache Beam 直接运行器添加到类路径? - How to add Apache Beam direct runner to classpath? 如何以最佳方式使用 Apache Beam 和 BigQueryIO 从多个 BigQuery 表中检索数据? - How can I retrieve data from multiple BigQuery tables using Apache Beam and BigQueryIO in the best way? 如何使用 Google Cloud Dataflow 增加 Apache Beam 管道工作线程上的线程堆栈大小? - How can I increase the thread stack size on Apache Beam pipeline workers with Google Cloud Dataflow? 如何获取管道中使用的所有 Apache Beam 指标的列表? - How can I get a list of all Apache Beam metrics used in the pipeline? 如何向束 FileIO.matchAll() 结果添加附加字段? - How to add additional field to beam FileIO.matchAll() result? 如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中? - How do I set the coder for a PCollection<List<String>> in Apache Beam? 如何使用 Apache Beam 管理背压 - How to manage backpressure with Apache Beam 如何动态添加图像到GridView? - How can I dynamically add images to a GridView?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM