使用 DataFlow PubSub 到 Cloud Storage 很慢

Question

I am using a slightly adjusted version of the following example snippet to write PubSub messages to GCS:我正在使用以下示例片段的略微调整版本将 PubSub 消息写入 GCS：

 class WriteToGCS(beam.DoFn): def __init__(self, output_path, prefix): self.output_path = output_path self.prefix = prefix def process(self, key_value, window=beam.DoFn.WindowParam): """Write messages in a batch to Google Cloud Storage.""" start_date = window.start.to_utc_datetime().date().isoformat() start = window.start.to_utc_datetime().isoformat() end = window.end.to_utc_datetime().isoformat() shard_id, batch = key_value filename = f'{self.output_path}/{start_date}/{start}-{end}/{self.prefix}-{start}-{end}-{shard_id:03d}' with beam.io.gcsio.GcsIO().open(filename=filename, mode="w") as f: for message_body in batch: f.write(f"{message_body},".encode("utf-8"))

However, its terribly slow.然而，它的速度非常慢。 This is how it looks like in the graph .这就是它在图中的样子。 Is there a way to speed up this step?有没有办法加快这一步？ The subscription gets 500 elements per second, so 3-10 elements per seconds is not keeping up.订阅每秒获得 500 个元素，因此每秒 3-10 个元素跟不上。

The pipeline looks like this:管道如下所示：

 class JsonWriter(beam.PTransform): def __init__(self, window_size, path, prefix, num_shards=20): self.window_size = int(window_size) self.path = path self.prefix = prefix self.num_shards = num_shards def expand(self, pcoll): return ( pcoll | "Group into fixed windows" >> beam.WindowInto(window.FixedWindows(self.window_size, 0)) | "Decode windowed elements" >> beam.ParDo(Decode()) | "Group into batches" >> beam.BatchElements() | "Add key" >> beam.WithKeys(lambda _: random.randint(0, int(self.num_shards) - 1)) | "Write to GCS" >> beam.ParDo(WriteToGCS(self.path, self.prefix)) )

Answer 1

Solved this with these steps:通过以下步骤解决了这个问题：

 | "Add Path" >> beam.ParDo(AddFilename(self.path, self.prefix, self.num_shards)) | "Group Paths" >> beam.GroupByKey() | "Join and encode" >> beam.ParDo(ToOneString()) | "Write to GCS" >> beam.ParDo(WriteToGCS())

First I add the paths per element, then I group by path (to prevent simultaneous writes to the same file/shard).首先，我添加每个元素的路径，然后按路径分组（以防止同时写入同一个文件/分片）。 Then I concatenate (new line delimited) the JSON strings per group (path) and then write the whole string to the file.然后我连接（新行分隔）每个组（路径）的 JSON 字符串，然后将整个 string 写入文件。 This reduces the calls to GCS tremendously, since I dont write every element one by one to a file, but instead join them first and then write once.这极大地减少了对 GCS 的调用，因为我不会将每个元素一个一个地写入文件，而是先将它们连接起来，然后再写入一次。 Hope it helps someone.希望它可以帮助某人。

使用 DataFlow PubSub 到 Cloud Storage 很慢

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-04 10:17:51

使用 DataFlow PubSub 到 Cloud Storage 很慢

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-04 10:17:51

解决方案1
1 已采纳 2022-07-04 10:17:51