简体繁体中英

Apache Beam Wall Time keeps increasing

原文 2020-12-09 19:32:47 4 1 python-3.x/ google-cloud-dataflow/ apache-beam

I have a Beam pipeline that reads from a pubsub topic, does some small transformations and then writes the events to some BigQuery tables.

The transforms are light on processing, maybe removing a field or something else, but, as you can see from the image below, the Wall Time is very high for some steps. What can actually cause this?

Every element is actually a tuple of the form ((str, str, str), {**dict with data}) . By this key we actually try to do a naive deduplication by taking the latest event by this key. Basically whatever I add after that Get latest element per key is slow, and tagging is also slow, even tho it just adds a tag to the element.

1 answers

By "slow" I assume you mean how many elements it processes per second?

There are two things that are going on here. First, I assume that Get latest element per key contains a GroupByKey of sorts. This involves a global shuffle, with all elements being sent over the network to other elements, to ensure all elements with a given key are grouped onto the same worker. This IO can be expensive, at least in terms of wall time.

Secondly, steps that don't need worker-to-worker communication are "fused" which couples their throughputs. Eg if one has DoFnA followed by DoFnB followed by DoFnC , the processing proceeds by passing the first element through DoFnA , then passing those outputs to DoFnB and subsequently DoFnC before getting the second element to pass to DoFnA . This means that if one of the Fns (or the reading or writing) has a bounded throughput, all of them will.

GroupByKey() with Apache Beam

Delay between Apache Beam FixedWindows

What do the "|" and ">>" means in Apache Beam?

How to run apache beam locally?

Merge PCollection with apache_beam

Render time increasing with time

python automation: time not increasing

Dataflow (Apache Beam) can't write on BigQuery

Apache beam multiple output for a single dictionary

Spacy Breaks serialization in Pardo - Apache Beam

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question GroupByKey() with Apache Beam Delay between Apache Beam FixedWindows What do the "|" and ">>" means in Apache Beam? How to run apache beam locally? Merge PCollection with apache_beam Render time increasing with time python automation: time not increasing Dataflow (Apache Beam) can't write on BigQuery Apache beam multiple output for a single dictionary Spacy Breaks serialization in Pardo - Apache Beam

Related Tags

Apache Beam Wall Time keeps increasing

Question

1 answers

solution1 0 ACCPTED 2020-12-09 22:40:13

solution1
0 ACCPTED 2020-12-09 22:40:13