[英]Write a Pandas DataFrame to Google Cloud Storage or BigQuery
Hello and thanks for your time and consideration.您好,感谢您的时间和考虑。 I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab.我正在 Google Cloud Platform/Datalab 中开发 Jupyter Notebook。 I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery.我已经创建了一个 Pandas DataFrame 并且想将此 DataFrame 写入 Google Cloud Storage(GCS) 和/或 BigQuery。 I have a bucket in GCS and have, via the following code, created the following objects:我在 GCS 中有一个存储桶,并通过以下代码创建了以下对象:
import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id
bucket_name = 'steve-temp'
bucket_path = bucket_name
bucket = storage.Bucket(bucket_path)
bucket.exists()
I have tried various approaches based on Google Datalab documentation but continue to fail.我尝试了基于 Google Datalab 文档的各种方法,但仍然失败。 Thanks谢谢
Try the following working example:尝试以下工作示例:
from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd
# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create storage bucket if it does not exist
if not sample_bucket.exists():
sample_bucket.create()
# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)
# Create BigQuery dataset
if not dataset.exists():
dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)
I used this example, and the _table.py file from the datalab github site as a reference.我使用了这个例子,以及来自datalab github 站点的_table.py文件作为参考。 You can find other datalab
source code files at this link.您可以在 此链接中找到其他datalab
源代码文件。
from google.cloud import storage
import os
import pandas as pd
# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'
df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')
Using the Google Cloud Datalab documentation使用 Google Cloud Datalab 文档
import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')
Update on @Anthonios Partheniou's answer.更新@Anthonios Partheniou 的回答。
The code is a bit different now - as of Nov. 29 2017现在的代码有点不同 - 截至2017 年 11 月 29 日
Pass a tuple containing project_id
and dataset_id
to bq.Dataset
.将包含project_id
和dataset_id
的元组dataset_id
给bq.Dataset
。
# define a BigQuery dataset
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)
Pass a tuple containing project_id
, dataset_id
and the table name to bq.Table
.将包含project_id
、 dataset_id
和表名的元组传递给bq.Table
。
# define a BigQuery table
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)
# Create BigQuery dataset
if not dataset.exists():
dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)
Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq自 2017 年以来,Pandas 有一个 Dataframe 到 BigQuery 函数pandas.DataFrame.to_gbq
The documentation has an example:该文档有一个示例:
import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')
Parameter if_exists
can be set to 'fail', 'replace' or 'append'参数if_exists
可以设置为 'fail'、'replace' 或 'append'
I spent a lot of time to find the easiest way to solve this:我花了很多时间找到最简单的方法来解决这个问题:
import pandas as pd
df = pd.DataFrame(...)
df.to_csv('gs://bucket/path')
I have a little bit simpler solution for the task using Dask .对于使用Dask的任务,我有一个更简单的解决方案。 You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage您可以将 DataFrame 转换为 Dask DataFrame,后者可以写入 Cloud Storage 上的 csv
import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,
storage_options={'token': gcs.session.credentials})
我认为你需要将它加载到一个普通的字节变量中,并在一个单独的单元格中使用 %%storage write --variable $sample_bucketpath(see the doc) ......我仍在弄清楚......但这大致与读取 CSV 文件所需要做的相反,我不知道它是否对写入有影响,但我必须使用 BytesIO 来读取由 %% storage read 命令创建的缓冲区......希望它帮助,让我知道!
To Google storage
:到Google storage
:
def write_df_to_gs(df, gs_key):
df.to_csv(gs_key)
To BigQuery
:到BigQuery
:
def upload_df_to_bq(df, project, bq_table):
df.to_gbq(bq_table, project_id=project, if_exists='replace')
To save a parquet file in GCS with authentication due Service Account:要使用身份验证到期服务帐户在 GCS 中保存镶木地板文件:
df.to_parquet("gs://<bucket-name>/file.parquet",
storage_options={"token": <path-to-gcs-service-account-file>}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.