In the Jupyter Notebook, I am trying to import data from BigQuery using an sql-like query on the BigQuery server. I then store the data in a dataframe:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
from google.cloud import bigquery
sql = """
SELECT * FROM dataset.table
"""
client = bigquery.Client()
df_bq = client.query(sql).to_dataframe()
The data has the shape (6000000, 8) and uses about 350MB of memory once stored in the dataframe.
The query sql
, if executed directly in BQ, takes about 2 seconds.
However, it usually takes about 30-40 minutes to execute the code above, and more often than not the code fails to execute raising the following error:
ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))
All in all, there could be three reasons for the error:
Would be happy to gain any insight into the problem, thanks in advance!
Use bigquery storage to get large data queries from bigquery into a pandas dataframe really fast.
Working code snippet:
import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage
# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Make clients.
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)
# define your query
your_query = """select * from your_big_query_table"""
# set you bqstorage_client as argument in the to_dataframe() method.
# i've also added the tqdm progress bar here so you get better insight
# into how long it's still going to take
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
bqstorage_client=bqstorageclient,
progress_bar_type='tqdm_notebook',)
)
You can find more on how to use bigquery storage here:
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas
Try using BigQuery Storage API - it's blazing fast for downloading large tables as pandas dataframes
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas
The WSAETIMEDOUT error means that the connected party did not properly respond after a period of time. You need to review your firewall.
Regarding:
This being said, you might reach the connection time out because the multidimensional array takes too long.
You can separate the query and the dataframe and print the time to have a better view of what is happening.
result = client.query(sql)
print(datetime.datetime.now())
df_bq = result.to_dataframe()
print(datetime.datetime.now())
If the above doesn't help, maybe write the file out to GCS from BQ and then copy to your server from there.
Alternatively you could run your notebook on a GCE VM and make the most of Google's bandwidth.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.