简体   繁体   中英

Takes too long to export data from bigquery into Jupyter notebook

In the Jupyter Notebook, I am trying to import data from BigQuery using an sql-like query on the BigQuery server. I then store the data in a dataframe:

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
from google.cloud import bigquery

sql = """
SELECT * FROM dataset.table
"""
client = bigquery.Client()
df_bq = client.query(sql).to_dataframe()

The data has the shape (6000000, 8) and uses about 350MB of memory once stored in the dataframe.

The query sql , if executed directly in BQ, takes about 2 seconds.

However, it usually takes about 30-40 minutes to execute the code above, and more often than not the code fails to execute raising the following error:

ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))

All in all, there could be three reasons for the error:

  1. It takes the BigQuery server a long time to execute the query
  2. It takes a long time to transfer data (I don't understand why a 350MB file should take 30min to be sent over the network. I tried using a LAN connection to eliminate server cuts and maximize throughput, which didn't help)
  3. It takes a long time to set a dataframe with the data from BigQuery

Would be happy to gain any insight into the problem, thanks in advance!

Use bigquery storage to get large data queries from bigquery into a pandas dataframe really fast.

Working code snippet:

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage

# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)

# define your query
your_query = """select * from your_big_query_table"""

# set you bqstorage_client as argument in the to_dataframe() method.
# i've also added the tqdm progress bar here so you get better insight
# into how long it's still going to take
dataframe = (
    bqclient.query(query_string)
            .result()
            .to_dataframe(
                bqstorage_client=bqstorageclient,
                progress_bar_type='tqdm_notebook',)
)

You can find more on how to use bigquery storage here:
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

Try using BigQuery Storage API - it's blazing fast for downloading large tables as pandas dataframes

https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

The WSAETIMEDOUT error means that the connected party did not properly respond after a period of time. You need to review your firewall.

Regarding:

  1. the query takes 2 seconds as you tested
  2. review your firewall
  3. as your data shape is (6000000, 8), this will take time depending on the computing resources you are using

This being said, you might reach the connection time out because the multidimensional array takes too long.

You can separate the query and the dataframe and print the time to have a better view of what is happening.

    result = client.query(sql)
    print(datetime.datetime.now())
    df_bq = result.to_dataframe()
    print(datetime.datetime.now())

If the above doesn't help, maybe write the file out to GCS from BQ and then copy to your server from there.

Alternatively you could run your notebook on a GCE VM and make the most of Google's bandwidth.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM