简体   繁体   中英

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT> and then across SimpleFunction<InputT, OutputT> .

Both of these look similar to me and I find it difficult to understand the difference.

Can someone explain the difference in layman terms?

Conceptually you can think of SimpleFunction is a simple case of DoFn :

  • SimpleFunction<InputT, OutputT> :

    • simple input to output mapping function;
    • single input produces single output;
    • statically typed, you have to @Override the apply() method;
    • doesn't depend on computation context;
    • can't use Beam state APIs;
    • example use case: MapElements.via(simpleFunction) to convert/modify elements one by one, producing one output for each element;
  • DoFn<InputT, OutputT> :

    • executed with ParDo ;
    • exposed to the context (timestamp, window pane, etc);
    • can consume side inputs;
    • can produce multiple outputs or no outputs at all;
    • can produce side outputs;
    • can use Beam's persistent state APIs;
    • dynamically typed;
    • example use case: read objects from a stream, filter, accumulate them, perform aggregations, convert them, and dispatch to different outputs;

You can find more specific examples and use cases for ParDos in the dev guide .

This part mentions the MapElements , which is the use case for SimpleFunctions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM