[英]Downloading Large data from bigquery dataset and pandas
I'm trying to download data from the big query public dataset and store it locally in a CSV
file.我正在尝试从大查询公共数据集中下载数据并将其本地存储在
CSV
文件中。 When I add LIMIT 10
at the end of the query, my code works but if not, I get an error that says:当我在查询末尾添加
LIMIT 10
时,我的代码可以工作,但如果没有,我会收到一条错误消息:
Response too large to return. Consider setting allowLargeResults to true in your job configuration.
Thank you in Advance!先感谢您!
Here is my code:这是我的代码:
import pandas as pd
import pandas_gbq as gbq
import tqdm
def get_data(query,project_id):
data = gbq.read_gbq(query, project_id=project_id,configuration={"allow_large_results":True})
data.to_csv('blockchain.csv',header=True,index=False)
if __name__ == "__main__":
query = """SELECT * FROM `bigquery-public-data.crypto_bitcoin.transactions` WHERE block_timestamp>='2017-09-1' and block_timestamp<'2017-10-1';"""
project_id = "bitcoin-274091"
get_data(query,project_id)
As was mentioned by @Graham Polley, at first you may consider to save results of your source query to some Bigquery table and then extract data from this table to GCS.正如@Graham Polley 所提到的,起初您可能会考虑将源查询的结果保存到某个 Bigquery 表中,然后从该表中提取数据到 GCS。 Due to the current
pandas_gbq
library limitations , to achieve your goal I would recommend using google-cloud-bigquery
package as the officially advised Python library managing interaction with Bigquery API .由于当前
pandas_gbq
库的限制,为了实现您的目标,我建议使用google-cloud-bigquery
package 作为官方建议的 Python 库管理与 Bigquery 的交互ZDB974238714CA8DE634A7A14
In the following example, I've used bigquery.Client.query() method to trigger a query job with job_config
configuration and then invoke bigquery.Client.extract_table() method to fetch the data and store it in GCS bucket:在下面的示例中,我使用bigquery.Client.query()方法通过
job_config
配置触发查询作业,然后调用bigquery.Client.extract_table()方法来获取数据并将其存储在 GCS 存储桶中:
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.QueryJobConfig(destination="project_id.dataset.table")
sql = """SELECT * FROM ..."""
query_job = client.query(sql, job_config=job_config)
query_job.result()
gs_path = "gs://bucket/test.csv"
ds = client.dataset(dataset, project=project_id)
tb = ds.table(table)
extract_job = client.extract_table(tb,gs_path,location='US')
extract_job.result()
As the end you can delete the table consisting staging data.最后,您可以删除包含暂存数据的表。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.