简体   繁体   中英

Apache Beam: DoFn vs PTransform

Both DoFn and PTransform is a means to define operation for PCollection . How do we know which to use when?

A simple way to understand it is by analogy with map(f) for lists:

  • The higher-order function map applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.
  • The function f is the logic applied to each element.

Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn) , which is a PTransform .

  • A PTransform is an operation that takes PCollections as input and yields PCollections as output. Beam has just five primitive types of PTransform , encapsulating embarrassingly parallel computational patterns.
  • ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.
  • The DoFn , here I called it fn , is the logic that is applied to each element.

It may also help to think of the fact that you write a DoFn to say what to do on each element, and the Beam runner provides the ParDo to apply your logic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM