将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery

Question

Hello and thanks for your time and consideration.您好，感谢您的时间和考虑。 I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab.我正在 Google Cloud Platform/Datalab 中开发 Jupyter Notebook。 I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery.我已经创建了一个 Pandas DataFrame 并且想将此 DataFrame 写入 Google Cloud Storage(GCS) 和/或 BigQuery。 I have a bucket in GCS and have, via the following code, created the following objects:我在 GCS 中有一个存储桶，并通过以下代码创建了以下对象：

import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()

I have tried various approaches based on Google Datalab documentation but continue to fail.我尝试了基于 Google Datalab 文档的各种方法，但仍然失败。 Thanks谢谢

Answer 1

Try the following working example:尝试以下工作示例：

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used this example, and the _table.py file from the datalab github site as a reference.我使用了这个例子，以及来自datalab github 站点的_table.py文件作为参考。 You can find other datalab source code files at this link.您可以在此链接中找到其他datalab源代码文件。

Answer 2

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module上传到 Google Cloud Storage 而不写入临时文件，仅使用标准 GCS 模块

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
    
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

Answer 3

Using the Google Cloud Datalab documentation使用 Google Cloud Datalab 文档

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

Answer 4

Writing a Pandas DataFrame to BigQuery将 Pandas DataFrame 写入 BigQuery

Update on @Anthonios Partheniou's answer.更新@Anthonios Partheniou 的回答。
The code is a bit different now - as of Nov. 29 2017现在的代码有点不同 - 截至2017 年 11 月 29 日

To define a BigQuery dataset定义 BigQuery 数据集

Pass a tuple containing project_id and dataset_id to bq.Dataset .将包含project_id和dataset_id的元组dataset_id给bq.Dataset 。

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table定义 BigQuery 表

Pass a tuple containing project_id , dataset_id and the table name to bq.Table .将包含project_id 、 dataset_id和表名的元组传递给bq.Table 。

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ创建数据集/表并写入 BQ 中的表

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)

Answer 5

Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq自 2017 年以来，Pandas 有一个 Dataframe 到 BigQuery 函数pandas.DataFrame.to_gbq

The documentation has an example:该文档有一个示例：

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

Parameter if_exists can be set to 'fail', 'replace' or 'append'参数if_exists可以设置为 'fail'、'replace' 或 'append'

See also this example .另请参阅此示例。

Answer 6

I spent a lot of time to find the easiest way to solve this:我花了很多时间找到最简单的方法来解决这个问题：

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')

Answer 7

I have a little bit simpler solution for the task using Dask .对于使用Dask的任务，我有一个更简单的解决方案。 You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage您可以将 DataFrame 转换为 Dask DataFrame，后者可以写入 Cloud Storage 上的 csv

import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials})

Answer 8

我认为你需要将它加载到一个普通的字节变量中，并在一个单独的单元格中使用 %%storage write --variable $sample_bucketpath(see the doc) ......我仍在弄清楚......但这大致与读取 CSV 文件所需要做的相反，我不知道它是否对写入有影响，但我必须使用 BytesIO 来读取由 %% storage read 命令创建的缓冲区......希望它帮助，让我知道！

Answer 9

To Google storage :到Google storage ：

def write_df_to_gs(df, gs_key):
    df.to_csv(gs_key)

To BigQuery :到BigQuery ：

def upload_df_to_bq(df, project, bq_table):
    df.to_gbq(bq_table, project_id=project, if_exists='replace')

Answer 10

To save a parquet file in GCS with authentication due Service Account:要使用身份验证到期服务帐户在 GCS 中保存镶木地板文件：

df.to_parquet("gs://<bucket-name>/file.parquet",
               storage_options={"token": <path-to-gcs-service-account-file>}

将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery

问题描述

10 个解决方案

解决方案1
23 已采纳 2016-03-31 11:20:02

解决方案2
22 2019-08-08 01:33:58

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module上传到 Google Cloud Storage 而不写入临时文件，仅使用标准 GCS 模块

解决方案3
17 2017-03-14 15:43:23

解决方案4
9 2017-11-29 16:36:51

Writing a Pandas DataFrame to BigQuery将 Pandas DataFrame 写入 BigQuery

To define a BigQuery dataset定义 BigQuery 数据集

To define a BigQuery table定义 BigQuery 表

Create the dataset/ table and write to table in BQ创建数据集/表并写入 BQ 中的表

解决方案5
7 2019-01-30 06:43:22

解决方案6
7 2020-03-11 21:31:55

解决方案7
4 2018-12-11 16:09:12

解决方案8
0 2016-03-30 19:37:23

解决方案9
0 2021-08-07 16:36:06

解决方案10
0 2021-12-16 12:21:39

将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery

问题描述

10 个解决方案

解决方案1 23 已采纳 2016-03-31 11:20:02

解决方案2 22 2019-08-08 01:33:58

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module上传到 Google Cloud Storage 而不写入临时文件，仅使用标准 GCS 模块

解决方案3 17 2017-03-14 15:43:23

解决方案4 9 2017-11-29 16:36:51

Writing a Pandas DataFrame to BigQuery将 Pandas DataFrame 写入 BigQuery

To define a BigQuery dataset定义 BigQuery 数据集

To define a BigQuery table定义 BigQuery 表

Create the dataset/ table and write to table in BQ创建数据集/表并写入 BQ 中的表

解决方案5 7 2019-01-30 06:43:22

解决方案6 7 2020-03-11 21:31:55

解决方案7 4 2018-12-11 16:09:12

解决方案8 0 2016-03-30 19:37:23

解决方案9 0 2021-08-07 16:36:06

解决方案10 0 2021-12-16 12:21:39

解决方案1
23 已采纳 2016-03-31 11:20:02

解决方案2
22 2019-08-08 01:33:58

解决方案3
17 2017-03-14 15:43:23

解决方案4
9 2017-11-29 16:36:51

解决方案5
7 2019-01-30 06:43:22

解决方案6
7 2020-03-11 21:31:55

解决方案7
4 2018-12-11 16:09:12

解决方案8
0 2016-03-30 19:37:23

解决方案9
0 2021-08-07 16:36:06

解决方案10
0 2021-12-16 12:21:39