简体   繁体   中英

Exporting BigQuery data for analysis using python

I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.

I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.

Some approaches I had in mind are:

  1. Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package . Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.

  2. Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package . Modify the BigQuery table after running each script.

Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?

Thanks!

The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.

However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.

Let me quickly go over the main topics you should have a look at:

  • Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
  • Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
  • Partitions: BigQuery offers the concept of partitioned tables , which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
  • Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.

So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM