[英]Exporting BigQuery data for analysis using python
I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case. 我是Google BigQuery的新手,所以我试图了解如何最好地完成用例。
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. 我将BigQuery的每日客户访问数据存储在BigQuery中,希望使用我用python编写的一些算法进行分析。 Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data.
由于有多个脚本使用每日数据的子集,所以我想知道什么是获取和临时存储数据的最佳方法。 Additionally, the scripts run in a sequential manner.
此外,脚本以顺序方式运行。 Each script modifies some columns of the data and the subsequent script uses this modified data.
每个脚本修改数据的某些列,随后的脚本使用此修改后的数据。 After all the scripts have run, I want to store the modified data back to BigQuery.
运行完所有脚本后,我想将修改后的数据存储回BigQuery。
Some approaches I had in mind are: 我想到的一些方法是:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package . 将bigquery表作为db文件导出到GAE(Google App Engine)实例中,并使用sqlite3 python包从db文件中查询每个脚本的相关数据。 Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
一旦所有脚本运行完毕,将修改后的表存储回BigQuery,然后从GAE实例中删除数据库文件。
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package . 每次我想使用google-cloud python客户端库或pandas gbq包运行脚本时,都可以从BigQuery查询数据。 Modify the BigQuery table after running each script.
运行每个脚本后,修改BigQuery表。
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives? 有人能知道其中哪一个是实现此目的的更好方法(就效率/成本而言)或提出替代方案?
Thanks! 谢谢!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it. 问题的答案主要取决于您的用例和将要处理的数据大小,因此没有一个绝对正确的答案。
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described. 但是,在您描述的场景中,您可能需要考虑一些有关BigQuery的用法以及它的某些功能对您来说很有趣的几点。
Let me quickly go over the main topics you should have a look at: 让我快速浏览一下您应该关注的主要主题:
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. 因此,总的来说,我要说的是,您无需保留部分结果来自BigQuery存储的任何其他数据库。 In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving.
在资源和成本效率方面,BigQuery提供了足够的功能供您在本地处理数据,而不必处理巨额费用或数据检索的延迟。 However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously;
但是,这再次取决于您的用例和要存储的数据量以及需要同时处理。 but in general terms, I would just go with BigQuery on its own.
但总的来说,我只会自己使用BigQuery。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.