简体繁体 English

导出BigQuery数据以使用python进行分析

[英]Exporting BigQuery data for analysis using python

原文 2018-01-30 05:25:34 2 1 python/ sqlite/ google-app-engine/ google-bigquery

I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case. 我是Google BigQuery的新手，所以我试图了解如何最好地完成用例。

I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. 我将BigQuery的每日客户访问数据存储在BigQuery中，希望使用我用python编写的一些算法进行分析。 Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. 由于有多个脚本使用每日数据的子集，所以我想知道什么是获取和临时存储数据的最佳方法。 Additionally, the scripts run in a sequential manner. 此外，脚本以顺序方式运行。 Each script modifies some columns of the data and the subsequent script uses this modified data. 每个脚本修改数据的某些列，随后的脚本使用此修改后的数据。 After all the scripts have run, I want to store the modified data back to BigQuery. 运行完所有脚本后，我想将修改后的数据存储回BigQuery。

Some approaches I had in mind are: 我想到的一些方法是：

Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package . 将bigquery表作为db文件导出到GAE（Google App Engine）实例中，并使用sqlite3 python包从db文件中查询每个脚本的相关数据。 Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance. 一旦所有脚本运行完毕，将修改后的表存储回BigQuery，然后从GAE实例中删除数据库文件。
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package . 每次我想使用google-cloud python客户端库或pandas gbq包运行脚本时，都可以从BigQuery查询数据。 Modify the BigQuery table after running each script. 运行每个脚本后，修改BigQuery表。

Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives? 有人能知道其中哪一个是实现此目的的更好方法（就效率/成本而言）或提出替代方案？

Thanks! 谢谢！

1 个解决方案

The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it. 问题的答案主要取决于您的用例和将要处理的数据大小，因此没有一个绝对正确的答案。

However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described. 但是，在您描述的场景中，您可能需要考虑一些有关BigQuery的用法以及它的某些功能对您来说很有趣的几点。

Let me quickly go over the main topics you should have a look at: 让我快速浏览一下您应该关注的主要主题：

Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. 定价：不考虑存储费用，而是着重查询本身的成本（这与您的用例更相关）， BigQuery计费基于每个查询处理的字节数。 There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data. 每月有1TB的免费配额，从那时起，成本为每TB处理数据5美元，这是10MB数据的最小可测量单位。
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). 缓存：当BigQuery返回某些信息时，它会存储在一个临时的缓存表中（如果需要，可以存储在一个永久的表中），并且将它们维护大约24小时，但您可能会在同一文档链接中找到一些例外（它们也尽力而为，因此也可能会发生更早的删除）。 Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. 从缓存表返回的结果不计入费用（因为根据记帐的定义，成本基于处理的字节数，访问缓存表意味着未进行任何处理），只要您运行完全相同的查询。 I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else. 我认为值得一看，因为从您的句子“因为有多个脚本使用了每日数据的子集”，也许（但只是在这里猜测）它适用于您的用例以执行单个查询一次，然后从缓存版本中多次检索结果，而不必将其存储在其他位置。
Partitions: BigQuery offers the concept of partitioned tables , which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require. 分区： BigQuery提供了分区表的概念，分区表是按日期划分为较小段的单个表，这将使您每天更轻松地查询所需的数据。
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms. 速度： BigQuery提供了一个实时分析平台，因此您将能够执行一些快速的查询，以检索所需的信息，并进行一些初始处理，以便以后在自定义Python算法中使用。

So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. 因此，总的来说，我要说的是，您无需保留部分结果来自BigQuery存储的任何其他数据库。 In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. 在资源和成本效率方面，BigQuery提供了足够的功能供您在本地处理数据，而不必处理巨额费用或数据检索的延迟。 However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; 但是，这再次取决于您的用例和要存储的数据量以及需要同时处理。 but in general terms, I would just go with BigQuery on its own. 但总的来说，我只会自己使用BigQuery。