简体   繁体   English

Google Cloud Dataflow - 从 PubSub 到 Parquet

[英]Google Cloud Dataflow - From PubSub to Parquet

I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow.我正在尝试使用 Google Cloud Dataflow 将 Google PubSub 消息写入 Google Cloud Storage。 The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file. PubSub 消息采用 json 格式,我要执行的唯一操作是从 json 转换为镶木地板文件。

In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket ( https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro ).在官方文档中,我找到了 google 提供的模板,该模板从 Pub/Sub 主题读取数据并将 Avro 文件写入指定的 Cloud Storage 存储桶( https://cloud.google.com/dataflow/docs/guides/templates/provided -streaming#pubsub-to-cloud-storage-avro )。 The problem is that the template source code is written in Java, while I prefer to use the Python SDK.问题是模板源码是用Java写的,而我更喜欢用Python SDK。

These are the first tests I'm doing with Dataflow and Beam in general, and there's not a lot of material online to take a hint from.这些是我对 Dataflow 和 Beam 进行的第一次测试,网上并没有很多资料可供参考。 Any suggestions, links, guidance, piece of code would be greatly appreciated.任何建议、链接、指导、代码将不胜感激。

In order to further contribute to the community, I am summarising our discussing as an answer.为了进一步为社区做出贡献,我将我们的讨论总结为答案。

Since you are starting with Dataflow, I can point out some useful topics and advice:由于您是从 Dataflow 开始的,我可以指出一些有用的主题和建议:

  1. The PTransform WriteToParquet() builtin method in Apache Beam is very useful. Apache Beam中的 PTransform WriteToParquet()内置方法非常有用。 It writes to a Parquet file from a PCollection of records.它从记录的PCollection写入Parquet文件。 Also, in order to use it and write to a parquet file, you would need to specify the schema as indicated in the documentation.此外,为了使用它并写入 parquet 文件,您需要按照文档中的说明指定架构。 In addition, this article will help you understand better how to use this method and how to write it in a Google Cloud Storage(GCS) bucket.此外, 本文将帮助您更好地了解如何使用此方法以及如何将其写入 Google Cloud Storage (GCS) 存储桶。

  2. Google provides this code explaining how read messages from PubSub and write them into Google Cloud Storage. Google 提供了这段代码,解释了如何从 PubSub 读取消息并将它们写入 Google Cloud Storage。 This QuickStart reads the message from PubSub and write the messages from each window to a bucket.此 QuickStart 从 PubSub 读取消息并将来自每个 window 的消息写入存储桶。

  3. Since you want to read from PubSub, write the message to Parquet and store the file in a GCS bucket, I would advise you to do the following process as steps of your pipeline: Read your messages, write to a parquet file and store it in GCS.由于您想从 PubSub 读取,将消息写入 Parquet 并将文件存储在 GCS 存储桶中,我建议您将以下过程作为管道的步骤:读取消息,写入 parquet 文件并将其存储在地面站。

I encourage you to read the above links.我鼓励您阅读上述链接。 Then if you have any other question you can post another thread in order to get more specific help.然后,如果您有任何其他问题,您可以发布另一个主题以获得更具体的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Google Cloud Dataflow - 来自 PubSub 消息的 Pyarrow 架构 - Google Cloud Dataflow - Pyarrow schema from PubSub message 使用谷歌云数据流读取多个镶木地板文件时如何判断记录来自哪个文件 - How to tell which file a record came from when reading multiple parquet files with google cloud dataflow 使用 DataFlow PubSub 到 Cloud Storage 很慢 - PubSub to Cloud Storage using DataFlow is slow Google Cloud Dataflow-Python将JSON流传输到PubSub-DirectRunner和DataflowRunner之间的区别 - Google Cloud Dataflow - Python Streaming JSON to PubSub - Differences between DirectRunner and DataflowRunner Google Python云数据流实例在没有新部署的情况下发生故障(pubsub导入失败) - Google Python cloud-dataflow instances broke without new deployment (failed pubsub import) Google Cloud Dataflow从字典中写入CSV - Google Cloud Dataflow Write to CSV from dictionary Google Cloud PubSub:不发送/接收来自 Cloud Functions 的所有消息 - Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions Google Cloud Dataflow 依赖项 - Google Cloud Dataflow Dependencies 带有Python的Google Cloud Dataflow - Google Cloud Dataflow with Python 如何通过Google-cloud-pubsub 0.28版的PublisherClient使用Pubsub仿真器 - How to use Pubsub Emulator with PublisherClient from google-cloud-pubsub version 0.28
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM