简体   繁体   中英

Apache Beam Wall Time keeps increasing

I have a Beam pipeline that reads from a pubsub topic, does some small transformations and then writes the events to some BigQuery tables.

The transforms are light on processing, maybe removing a field or something else, but, as you can see from the image below, the Wall Time is very high for some steps. What can actually cause this?

Every element is actually a tuple of the form ((str, str, str), {**dict with data}) . By this key we actually try to do a naive deduplication by taking the latest event by this key. Basically whatever I add after that Get latest element per key is slow, and tagging is also slow, even tho it just adds a tag to the element.

管道屏幕

By "slow" I assume you mean how many elements it processes per second?

There are two things that are going on here. First, I assume that Get latest element per key contains a GroupByKey of sorts. This involves a global shuffle, with all elements being sent over the network to other elements, to ensure all elements with a given key are grouped onto the same worker. This IO can be expensive, at least in terms of wall time.

Secondly, steps that don't need worker-to-worker communication are "fused" which couples their throughputs. Eg if one has DoFnA followed by DoFnB followed by DoFnC , the processing proceeds by passing the first element through DoFnA , then passing those outputs to DoFnB and subsequently DoFnC before getting the second element to pass to DoFnA . This means that if one of the Fns (or the reading or writing) has a bounded throughput, all of them will.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM