[英]Moving from pubsub->bigquery to pubsub->gcs (avro)->bigquery
Our current data pipeline streams our events "directly" to bigquery.我们当前的数据管道将我们的事件“直接”传输到 bigquery。
We have a stream of messages in pubsub, which we first read using dataflow, enrich, and write into other pubsub topic, and then we read it using another dataflow job, and write into bigquery.我们在 pubsub 中有一个消息流,我们首先使用数据流读取,丰富并写入其他 pubsub 主题,然后我们使用另一个数据流作业读取它,并写入 bigquery。
It works fine, but it doesn't support proper error handling - we just drop invalid messages, instead of handling them, or at least save them for later.它工作正常,但不支持正确的错误处理——我们只是丢弃无效消息,而不是处理它们,或者至少保存它们以备后用。
We are thinking on enhancing the process, keep invalid messages aside, and allow simple fix of them later on.我们正在考虑改进流程,将无效消息放在一边,并允许稍后对其进行简单修复。
My first approach was writing those problematic messages into a different pubsub topic, and handle them from there, but few people suggested saving them into GCS (maybe as AVRO files) instead.我的第一种方法是将这些有问题的消息写入不同的 pubsub 主题,并从那里处理它们,但很少有人建议将它们保存到 GCS(可能作为 AVRO 文件)中。
The question is: if we use GCS and AVRO, why not do it for all messages ?问题是:如果我们使用 GCS 和 AVRO,为什么不对所有消息都这样做? Instead of enriching and writing to pubsub, why not enriching and writing to GCS ?
与其丰富和写入 pubsub,为什么不丰富和写入 GCS 呢?
If we do that, we could use AVROIO()
using watchForNewFiles()
, and it seems straight forward.如果我们这样做,我们可以使用
AVROIO()
和watchForNewFiles()
,这看起来很简单。
But this sounds too simple, and too good.但这听起来太简单了,也太好了。 Before jumping into coding, I am concerned from few things:
在开始编码之前,我担心以下几点:
watchForNewFiles()
suppose to work flawlessly as is ?watchForNewFiles()
假设可以完美地工作? Would it be based on file timestamp ?FileIO
code, it seems the method is quite naive, which means the bigger the bucket grows, the longer the match will take.FileIO
代码,似乎该方法非常幼稚,这意味着桶越大,匹配所需的时间就越长。 Do I miss anything ?我想念什么吗? Isn't this solution fit less for endless streaming than pubsub ?
与 pubsub 相比,此解决方案不是更适合无休止的流媒体吗?
watchForNewFiles()
.watchForNewFiles()
轮询单个无限增长的 GCS 存储桶会出现问题。 I couldn't find the official document mentioning scalability of list
API call but it's reasonable to think it has O(n) complexity.list
API 调用的可扩展性的官方文档,但认为它具有 O(n) 复杂度是合理的。 If you want to use your pipeline in production environment and have GCP support subscription, I would recommend you to talk to GCP support about scalability of polling a large GCS bucket.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.