简体   繁体   English

将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery

[英]Write a Pandas DataFrame to Google Cloud Storage or BigQuery

Hello and thanks for your time and consideration.您好,感谢您的时间和考虑。 I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab.我正在 Google Cloud Platform/Datalab 中开发 Jupyter Notebook。 I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery.我已经创建了一个 Pandas DataFrame 并且想将此 DataFrame 写入 Google Cloud Storage(GCS) 和/或 BigQuery。 I have a bucket in GCS and have, via the following code, created the following objects:我在 GCS 中有一个存储桶,并通过以下代码创建了以下对象:

import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()  

I have tried various approaches based on Google Datalab documentation but continue to fail.我尝试了基于 Google Datalab 文档的各种方法,但仍然失败。 Thanks谢谢

Try the following working example:尝试以下工作示例:

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used this example, and the _table.py file from the datalab github site as a reference.我使用了这个例子,以及来自datalab github 站点_table.py文件作为参考。 You can find other datalab source code files at this link.您可以在 链接中找到其他datalab源代码文件。

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module上传到 Google Cloud Storage 而不写入临时文件,仅使用标准 GCS 模块

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
    
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

Using the Google Cloud Datalab documentation使用 Google Cloud Datalab 文档

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

Writing a Pandas DataFrame to BigQuery将 Pandas DataFrame 写入 BigQuery

Update on @Anthonios Partheniou's answer.更新@Anthonios Partheniou 的回答。
The code is a bit different now - as of Nov. 29 2017现在的代码有点不同 - 截至2017 年 11 月 29 日

To define a BigQuery dataset定义 BigQuery 数据集

Pass a tuple containing project_id and dataset_id to bq.Dataset .将包含project_iddataset_id的元组dataset_idbq.Dataset

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table定义 BigQuery 表

Pass a tuple containing project_id , dataset_id and the table name to bq.Table .将包含project_iddataset_id和表名的元组传递给bq.Table

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ创建数据集/表并写入 BQ 中的表

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)

Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq自 2017 年以来,Pandas 有一个 Dataframe 到 BigQuery 函数pandas.DataFrame.to_gbq

The documentation has an example:文档有一个示例:

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

Parameter if_exists can be set to 'fail', 'replace' or 'append'参数if_exists可以设置为 'fail'、'replace' 或 'append'

See also this example .另请参阅此示例

I spent a lot of time to find the easiest way to solve this:我花了很多时间找到最简单的方法来解决这个问题:

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')

I have a little bit simpler solution for the task using Dask .对于使用Dask的任务,我有一个更简单的解决方案。 You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage您可以将 DataFrame 转换为 Dask DataFrame,后者可以写入 Cloud Storage 上的 csv

import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials})  

我认为你需要将它加载到一个普通的字节变量中,并在一个单独的单元格中使用 %%storage write --variable $sample_bucketpath(see the doc) ......我仍在弄清楚......但这大致与读取 CSV 文件所需要做的相反,我不知道它是否对写入有影响,但我必须使用 BytesIO 来读取由 %% storage read 命令创建的缓冲区......希望它帮助,让我知道!

To Google storage :Google storage

def write_df_to_gs(df, gs_key):
    df.to_csv(gs_key)    

To BigQuery :BigQuery

def upload_df_to_bq(df, project, bq_table):
    df.to_gbq(bq_table, project_id=project, if_exists='replace')

To save a parquet file in GCS with authentication due Service Account:要使用身份验证到期服务帐户在 GCS 中保存镶木地板文件:

df.to_parquet("gs://<bucket-name>/file.parquet",
               storage_options={"token": <path-to-gcs-service-account-file>}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM