简体繁体 English

BigQuery）

[英]Google Cloud Dataflow ETL (Datastore -> Transform -> BigQuery)

原文 2015-08-07 14:06:23 5 2 google-app-engine/ google-bigquery/ google-cloud-datastore/ google-cloud-dataflow

We have an application running on Google App Engine using Datastore as persistence back-end. 我们使用数据存储作为持久性后端在Google App Engine上运行应用程序。 Currently application has mostly 'OLTP' features and some rudimentary reporting. 目前，应用程序主要具有“OLTP”功能和一些初步报告。 While implementing reports we experienced that processing large amount of data (millions of objects) is very difficult using Datastore and GQL. 在实施报告时，我们遇到使用数据存储区和GQL处理大量数据（数百万个对象）非常困难。 To enhance our application with proper reports and Business Intelligence features we think its better to setup a ETL process to move data from Datastore to BigQuery. 为了使用适当的报告和商业智能功能增强我们的应用程序，我们认为最好设置ETL过程以将数据从Datastore移动到BigQuery。

Initially we thought of implementing the ETL process as App Engine cron job but it looks like Dataflow can also be used for this. 最初我们考虑将ETL过程实现为App Engine cron作业，但看起来Dataflow也可用于此。 We have following requirements for setting up the process 我们有以下设置流程的要求

Be able to push all existing data to BigQuery by using Non streaming API of BigQuery. 能够通过使用BigQuery的非流API将所有现有数据推送到BigQuery。
Once above is done, push any new data whenever it is updated/created in Datastore to BigQuery using streaming API. 完成上述操作后，只要使用流API在Datastore到BigQuery中更新/创建任何新数据，就将其推送。

My Questions are 我的问题是

Is Cloud Dataflow right candidate for implementing this pipeline? Cloud Dataflow是否适合实施此管道？
Will we be able to push existing data? 我们能够推送现有数据吗？ Some of the Kinds have millions of objects. 一些种类有数百万个对象。
What should be the right approach to implement it? 实施它的正确方法应该是什么？ We are considering two approaches. 我们正在考虑两种方法。 First approach is to go through pub/sub ie for existing data create a cron job and push all data to pub/sub. 第一种方法是通过pub / sub，即对于现有数据创建一个cron作业并将所有数据推送到pub / sub。 For any new updates push data to pub/sub at the same time it is updated in DataStore. 对于任何新的更新，在DataStore中更新数据的同时将数据推送到pub / sub。 Dataflow Pipeline will pick it from pub/sub and push it to BigQuery. Dataflow Pipeline将从pub / sub中选择它并将其推送到BigQuery。 Second approach is to create a batch Pipeline in Dataflow that will query DataStore and pushes any new data to BigQuery. 第二种方法是在Dataflow中创建一个批处理管道，它将查询DataStore并将任何新数据推送到BigQuery。

Question is are these two approaches doable? 问题是这两种方法可行吗？ which one is better cost wise? 哪一个更好的成本？ Is there any other way which is better than above two? 有没有比上面两个更好的其他方式？

Thank you, 谢谢，

rizTaak rizTaak

2 个解决方案

Dataflow can absolutely be used for this purpose. 数据流绝对可以用于此目的。 In fact, Dataflow's scalability should make the process fast and relatively easy. 实际上，Dataflow的可扩展性应该使流程快速且相对容易。

Both of your approaches should work -- I'd give a preference to the second one of using a batch pipeline to move the existing data, and then a streaming pipeline to handle new data via Cloud Pub/Sub. 你的两种方法都应该有效 - 我会优先考虑使用批处理管道来移动现有数据，然后通过Cloud Pub / Sub处理新数据的流管道。 In addition to the data movement, Dataflow allow arbitrary analytics/manipulation to be performed on the data itself. 除数据移动外，Dataflow还允许对数据本身执行任意分析/操作。

That said, BigQuery and Datastore can be connected directly. 也就是说，BigQuery和Datastore可以直接连接。 See, for example, Loading Data From Cloud Datastore in BigQuery documentation. 例如，请参阅BigQuery文档中的从云数据存储区加载数据。

Another way it to use a 3rd party solutions for loading data to Google BigQuery. 另一种使用第三方解决方案将数据加载到Google BigQuery的方法。 There are plenty of them here . 有很多人在这里。 Most of them are paid, but there are free one with limited data loading frequency. 他们中的大多数都是付费的，但有一个免费的数据加载频率有限。 In this case you won't need to code anything. 在这种情况下，您不需要编写任何代码。