What is the correct way to organize the ptransforms in a beam pipeline?

Question

I'm developing one pipeline that reads data from Kafka.

The source kafka topic is quite big in terms of traffic, there are 10k messages inserted per second and each of the message is around 200kB

I need to filter the data in order to apply the transformations that I need but something I'm sure is if there is an order in which I need to apply the filter and window functions.

read->window->filter->transform->write

would be more efficient than

read->filter->window->transform->write

or it would be the same performance both options?

I know that samza is just a model that only tells the what and not the how and the runner optimizes the pipeline but I just want to be sure I got it correct

Thanks

Answer 1

If there is substantial filtering, windowing after the filter will technically reduce the amount of work performed, though that saved work is cheap enough that I doubt it'd make a measurable difference. (Presumably the runner could notice that the filter does not observe the assigned window and lift it in that case, but as mentioned it's unclear if there are really savings to be gained here...)

What is the correct way to organize the ptransforms in a beam pipeline?

Question

1 answers

solution1
0 2023-01-12 23:11:05

What is the correct way to organize the ptransforms in a beam pipeline?

Question

1 answers

solution1 0 2023-01-12 23:11:05

solution1
0 2023-01-12 23:11:05