简体   繁体   English

使用 DataFlow PubSub 到 Cloud Storage 很慢

[英]PubSub to Cloud Storage using DataFlow is slow

I am using a slightly adjusted version of the following example snippet to write PubSub messages to GCS:我正在使用以下示例片段的略微调整版本将 PubSub 消息写入 GCS:

 class WriteToGCS(beam.DoFn): def __init__(self, output_path, prefix): self.output_path = output_path self.prefix = prefix def process(self, key_value, window=beam.DoFn.WindowParam): """Write messages in a batch to Google Cloud Storage.""" start_date = window.start.to_utc_datetime().date().isoformat() start = window.start.to_utc_datetime().isoformat() end = window.end.to_utc_datetime().isoformat() shard_id, batch = key_value filename = f'{self.output_path}/{start_date}/{start}-{end}/{self.prefix}-{start}-{end}-{shard_id:03d}' with beam.io.gcsio.GcsIO().open(filename=filename, mode="w") as f: for message_body in batch: f.write(f"{message_body},".encode("utf-8"))

However, its terribly slow.然而,它的速度非常慢。 This is how it looks like in the graph .这就是它在图中的样子 Is there a way to speed up this step?有没有办法加快这一步? The subscription gets 500 elements per second, so 3-10 elements per seconds is not keeping up.订阅每秒获得 500 个元素,因此每秒 3-10 个元素跟不上。

The pipeline looks like this:管道如下所示:

 class JsonWriter(beam.PTransform): def __init__(self, window_size, path, prefix, num_shards=20): self.window_size = int(window_size) self.path = path self.prefix = prefix self.num_shards = num_shards def expand(self, pcoll): return ( pcoll | "Group into fixed windows" >> beam.WindowInto(window.FixedWindows(self.window_size, 0)) | "Decode windowed elements" >> beam.ParDo(Decode()) | "Group into batches" >> beam.BatchElements() | "Add key" >> beam.WithKeys(lambda _: random.randint(0, int(self.num_shards) - 1)) | "Write to GCS" >> beam.ParDo(WriteToGCS(self.path, self.prefix)) )

Solved this with these steps:通过以下步骤解决了这个问题:

 | "Add Path" >> beam.ParDo(AddFilename(self.path, self.prefix, self.num_shards)) | "Group Paths" >> beam.GroupByKey() | "Join and encode" >> beam.ParDo(ToOneString()) | "Write to GCS" >> beam.ParDo(WriteToGCS())

First I add the paths per element, then I group by path (to prevent simultaneous writes to the same file/shard).首先,我添加每个元素的路径,然后按路径分组(以防止同时写入同一个文件/分片)。 Then I concatenate (new line delimited) the JSON strings per group (path) and then write the whole string to the file.然后我连接(新行分隔)每个组(路径)的 JSON 字符串,然后将整个 string 写入文件。 This reduces the calls to GCS tremendously, since I dont write every element one by one to a file, but instead join them first and then write once.这极大地减少了对 GCS 的调用,因为我不会将每个元素一个一个地写入文件,而是先将它们连接起来,然后再写入一次。 Hope it helps someone.希望它可以帮助某人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Beam / Dataflow 自定义 Python 作业 - Cloud Storage 到 PubSub - Beam / Dataflow Custom Python job - Cloud Storage to PubSub 触发云存储 - 数据流 - Trigger Cloud Storage - Dataflow Google Cloud Dataflow - 从 PubSub 到 Parquet - Google Cloud Dataflow - From PubSub to Parquet 使用 Dataflow + Beam + Python 从 Google Cloud Storage 读取 Shapefile - Read Shapefile from Google Cloud Storage using Dataflow + Beam + Python 使用python在没有数据流的情况下发布到bigquery - pubsub to bigquery without dataflow using python Cloud PubSub到App Engine内存问题(我应该转移到DataFlow吗?) - Cloud PubSub to App Engine memory Issue (and Should I move to DataFlow?) Google Cloud Dataflow - 来自 PubSub 消息的 Pyarrow 架构 - Google Cloud Dataflow - Pyarrow schema from PubSub message 使用Python SDK进行数据流流式处理:将PubSub消息转换为BigQuery输出 - Dataflow Streaming using Python SDK: Transform for PubSub Messages to BigQuery Output Google Cloud Dataflow访问云存储上的.txt文件 - Google Cloud Dataflow access .txt file on cloud storage Trigger in cloud function 监控云存储并触发数据流 - Trigger in cloud function that monitors cloud storage and triggers dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM