简体繁体 English

bigquery

[英]Moving from pubsub->bigquery to pubsub->gcs (avro)->bigquery

原文 2020-01-01 11:29:48 1 1 google-bigquery/ google-cloud-dataflow/ apache-beam/ google-cloud-pubsub

Our current data pipeline streams our events "directly" to bigquery.我们当前的数据管道将我们的事件“直接”传输到 bigquery。
We have a stream of messages in pubsub, which we first read using dataflow, enrich, and write into other pubsub topic, and then we read it using another dataflow job, and write into bigquery.我们在 pubsub 中有一个消息流，我们首先使用数据流读取，丰富并写入其他 pubsub 主题，然后我们使用另一个数据流作业读取它，并写入 bigquery。
It works fine, but it doesn't support proper error handling - we just drop invalid messages, instead of handling them, or at least save them for later.它工作正常，但不支持正确的错误处理——我们只是丢弃无效消息，而不是处理它们，或者至少保存它们以备后用。
We are thinking on enhancing the process, keep invalid messages aside, and allow simple fix of them later on.我们正在考虑改进流程，将无效消息放在一边，并允许稍后对其进行简单修复。
My first approach was writing those problematic messages into a different pubsub topic, and handle them from there, but few people suggested saving them into GCS (maybe as AVRO files) instead.我的第一种方法是将这些有问题的消息写入不同的 pubsub 主题，并从那里处理它们，但很少有人建议将它们保存到 GCS（可能作为 AVRO 文件）中。
The question is: if we use GCS and AVRO, why not do it for all messages ?问题是：如果我们使用 GCS 和 AVRO，为什么不对所有消息都这样做？ Instead of enriching and writing to pubsub, why not enriching and writing to GCS ?与其丰富和写入 pubsub，为什么不丰富和写入 GCS 呢？
If we do that, we could use AVROIO() using watchForNewFiles() , and it seems straight forward.如果我们这样做，我们可以使用AVROIO()和watchForNewFiles() ，这看起来很简单。
But this sounds too simple, and too good.但这听起来太简单了，也太好了。 Before jumping into coding, I am concerned from few things:在开始编码之前，我担心以下几点：

I know using windows on dataflow makes the streaming as batched data, but it is much more flexible than checking for new files every X minutes.我知道在数据流上使用 windows 使流式传输成为批处理数据，但它比每 X 分钟检查一次新文件灵活得多。 How would I, for example, handle late data, etc. ?例如，我将如何处理延迟数据等？
The job runs endlessly, the AVRO files will be piled into one bucket, and watchForNewFiles() suppose to work flawlessly as is ?作业无休止地运行，AVRO 文件将被堆积到一个桶中，并且watchForNewFiles()假设可以完美地工作？ Would it be based on file timestamp ?它会基于文件时间戳吗？ naming format ?命名格式 ? Keeping "list" of known old files ??保留已知旧文件的“列表”？ Reading FileIO code, it seems the method is quite naive, which means the bigger the bucket grows, the longer the match will take.阅读FileIO代码，似乎该方法非常幼稚，这意味着桶越大，匹配所需的时间就越长。

Do I miss anything ?我想念什么吗？ Isn't this solution fit less for endless streaming than pubsub ?与 pubsub 相比，此解决方案不是更适合无休止的流媒体吗？

1 个解决方案

There is a set of APIs which controls how to handle the late data有一组 API可以控制如何处理迟到的数据
I guess it would be problematic if you poll a single infinitely growing GCS bucket with watchForNewFiles() .我想如果您使用watchForNewFiles()轮询单个无限增长的 GCS 存储桶会出现问题。 I couldn't find the official document mentioning scalability of list API call but it's reasonable to think it has O(n) complexity.我找不到提到list API 调用的可扩展性的官方文档，但认为它具有 O(n) 复杂度是合理的。 If you want to use your pipeline in production environment and have GCP support subscription, I would recommend you to talk to GCP support about scalability of polling a large GCS bucket.如果您想在生产环境中使用管道并订阅 GCP 支持，我建议您与 GCP 支持讨论轮询大型 GCS 存储桶的可扩展性。