简体繁体 English

BigQuery到Hadoop群集-如何传输数据？

[英]BigQuery to Hadoop Cluster - How to transfer data?

原文 2015-01-13 21:43:39 7 2 python/ hadoop/ google-analytics/ google-bigquery/ etl

I have a Google Analytics (GA) account which tracks the user activity of an app. 我有一个Google Analytics（分析）帐户，该帐户可跟踪应用程序的用户活动。 I got BigQuery set up so that I can access the raw GA data. 我设置了BigQuery，以便可以访问原始GA数据。 Data is coming in from GA to BigQuery on a daily basis. 每天都有数据从GA传入BigQuery。

I have a python app which queries the BigQuery API programmatically. 我有一个python应用，可通过编程方式查询BigQuery API。 This app is giving me the required response, depending on what I am querying for. 这个应用程式会根据我要查询的内容给我所需的回应。

My next step is to get this data from BigQuery and dump it into a Hadoop cluster. 我的下一步是从BigQuery获取数据并将其转储到Hadoop集群中。 I would like to ideally create a hive table using the data. 我想理想地使用数据创建一个配置单元表。 I would like to build something like an ETL process around the python app. 我想围绕python应用程序构建类似ETL流程的东西。 For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster. 例如，每天，我运行etl进程，该进程运行python应用程序，还将数据导出到集群。

Eventually, this ETL process should be put on Jenkins and should be able to run on production systems. 最终，此ETL流程应该放在Jenkins上，并且应该能够在生产系统上运行。

What architecture/design/general factors would I need to consider while planning for this ETL process? 在规划此ETL流程时，我需要考虑哪些架构/设计/一般因素？

Any suggestions on how I should go about this? 关于我应该如何处理的任何建议？ I am interested in doing this in the most simple and viable way. 我有兴趣以最简单和可行的方式做到这一点。

Thanks in advance. 提前致谢。

2 个解决方案

The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop 从BigQuery迁移到Hadoop的最简单方法是使用官方的Google BigQuery Connector for Hadoop

https://cloud.google.com/hadoop/bigquery-connector https://cloud.google.com/hadoop/bigquery-connector

This connector defines a BigQueryInputFormat class. 该连接器定义了BigQueryInputFormat类。

Write a query to select the appropriate BigQuery objects. 编写查询以选择适当的BigQuery对象。
Splits the results of the query evenly among the Hadoop nodes. 在Hadoop节点之间平均分配查询结果。
Parses the splits into java objects to pass to the mapper. 将拆分解析为java对象，以传递给映射器。 The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object. Hadoop Mapper类接收每个选定BigQuery对象的JsonObject表示形式。

(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes) （它使用Google Cloud Storage作为BigQuery数据和Hadoop消耗的拆分之间的中介）

Check out Oozie . 看看Oozie 。 It seems to fit your requirements. 它似乎符合您的要求。 It has workflow engine, scheduling support and shell script and hive support. 它具有工作流引擎，调度支持以及shell脚本和配置单元支持。

In terms of installation and deployment, it's usually part of hadoop distribution, but can be installed separately. 就安装和部署而言，它通常是hadoop分发的一部分，但可以单独安装。 It has a dependency of db as persistence layer. 它具有db作为持久层的依赖项。 That may require some extra efforts. 这可能需要一些额外的努力。

It has web UI and rest API. 它具有Web UI和rest API。 Managing and monitoring jobs could be automated if desired. 如果需要，可以自动管理和监视作业。