从 Google Cloud 存储读取 csv 到 pandas 数据框

Question

我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到熊猫数据帧上。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

它显示此错误消息：

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

我做错了什么，我找不到任何不涉及 google datalab 的解决方案？

Answer 1

更新

从 pandas 0.24 版本开始， read_csv支持直接从 Google Cloud Storage 读取。 只需像这样提供指向存储桶的链接：

df = pd.read_csv('gs://bucket/your_path.csv')

然后read_csv将使用gcsfs模块来读取 Dataframe，这意味着它必须被安装（或者你会得到一个指向缺少依赖项的异常）。

为了完整起见，我留下了其他三个选项。

自制代码
gcsfs
黎明

我将在下面介绍它们。

困难的方法：自己动手的代码

我写了一些方便的函数来从谷歌存储中读取。 为了使其更具可读性，我添加了类型注释。 如果您碰巧使用的是 Python 2，只需删除这些，代码就可以正常工作。

假设您已获得授权，它同样适用于公共和私人数据集。 在这种方法中，您无需先将数据下载到本地驱动器。

如何使用它：

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

编码：

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs

gcsfs是“用于谷歌云存储的 Pythonic 文件系统”。

如何使用它：

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

黎明

Dask “为分析提供高级并行性，为您喜爱的工具实现大规模性能”。 当您需要在 Python 中处理大量数据时，它非常棒。 Dask 尝试模仿pandas API 的大部分内容，使其易于新手使用。

这是read_csv

如何使用它：

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

Answer 2

另一种选择是使用 TensorFlow，它具有从 Google Cloud Storage 进行流式读取的能力：

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df = pd.read_csv(f)

使用 tensorflow 还为您提供了一种方便的方法来处理文件名中的通配符。 例如：

将通配符 CSV 读入 Pandas

下面的代码会将与特定模式（例如：gs://bucket/some/dir/train-*）匹配的所有 CSV 读入 Pandas 数据帧：

import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd

def read_csv_file(filename):
  with file_io.FileIO(filename, 'r') as f:
    df = pd.read_csv(f, header=None, names=['col1', 'col2'])
    return df

def read_csv_files(filename_pattern):
  filenames = tf.gfile.Glob(filename_pattern)
  dataframes = [read_csv_file(filename) for filename in filenames]
  return pd.concat(dataframes)

用法

DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))

Answer 3

从pandas==0.24.0 ，如果您安装了gcsfs ，则本机支持： https ://github.com/pandas-dev/pandas/pull/22704。

在正式发布之前，您可以使用pip install pandas==0.24.0rc1 。

Answer 4

我正在看这个问题，不想费心安装另一个库gcsfs ，它在文档中字面意思是， This software is beta, use at your own risk ......但我发现我想在这里发布一个很好的解决方法，以防这对其他人有帮助，只使用 google.cloud 存储库和一些本机 python 库。 这是功能：

import pandas as pd
from google.cloud import storage
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/creds.json'


def gcp_csv_to_df(bucket_name, source_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    data = blob.download_as_string()
    df = pd.read_csv(io.BytesIO(data))
    print(f'Pulled down file from bucket {bucket_name}, file name: {source_file_name}')
    return df

此外，虽然它超出了这个问题的范围，但如果您想使用类似的功能将 pandas 数据帧上传到 GCP，以下是执行此操作的代码：

def df_to_gcp_csv(df, dest_bucket_name, dest_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(dest_bucket_name)
    blob = bucket.blob(dest_file_name)
    blob.upload_from_string(df.to_csv(), 'text/csv')
    print(f'DataFrame uploaded to bucket {dest_bucket_name}, file name: {dest_file_name}')

希望这有帮助！ 我知道我肯定会使用这些功能。

Answer 5

从 Pandas 1.2 开始，将文件从谷歌存储加载到 DataFrame 中非常容易。

如果你在本地机器上工作，它看起来像这样：

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "credentials.json"})

它已导入，您将来自 google 的 credentials.json 文件添加为令牌。

如果您在谷歌云上工作，请执行以下操作：

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "cloud"})

Answer 6

read_csv不支持gs://

从文档中：

该字符串可以是一个 URL。 有效的 URL 方案包括 http、ftp、s3 和文件。 对于文件 URL，需要一个主机。 例如，本地文件可以是文件：//localhost/path/to/table.csv

您可以下载文件或将其作为字符串获取以进行操作。

Answer 7

在 GCS 中访问文件的方式有以下三种：

下载客户端库（这个给你）
在 Google Cloud Platform Console 中使用 Cloud Storage 浏览器
使用 gsutil，一种用于处理 Cloud Storage 中文件的命令行工具。

使用步骤 1，为您的工作设置GSC。 之后，您必须：

import cloudstorage as gcs
from google.appengine.api import app_identity

然后，您必须指定 Cloud Storage 存储桶名称并创建读/写函数以访问您的存储桶：

您可以在此处找到剩余的读/写教程：

Answer 8

使用pandas和google-cloud-storage python 包：

首先，我们将文件上传到存储桶，以获得一个完整的示例：

import pandas as pd
from sklearn.datasets import load_iris

dataset = load_iris()

data_df = pd.DataFrame(
    dataset.data,
    columns=dataset.feature_names)

data_df.head()

Out[1]: 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

将 csv 文件上传到存储桶（需要设置 GCP 凭据，在此处阅读更多信息）：

from io import StringIO
from google.cloud import storage

bucket_name = 'my-bucket-name' # Replace it with your own bucket name.
data_path = 'somepath/data.csv'

# Get Google Cloud client
client = storage.Client()

# Get bucket object
bucket = client.get_bucket(bucket_name)

# Get blob object (this is pointing to the data_path)
data_blob = bucket.blob(data_path)

# Upload a csv to google cloud storage
data_blob.upload_from_string(
    data_df.to_csv(), 'text/csv')

现在我们在存储桶上有一个 csv，通过传递文件的内容来使用pd.read_csv 。

# Read from bucket
data_str = data_blob.download_as_text()

# Instanciate dataframe
data_dowloaded_df = pd.read_csv(StringIO(data_str))

data_dowloaded_df.head()

Out[2]: 
   Unnamed: 0  sepal length (cm)  ...  petal length (cm)  petal width (cm)
0           0                5.1  ...                1.4               0.2
1           1                4.9  ...                1.4               0.2
2           2                4.7  ...                1.3               0.2
3           3                4.6  ...                1.5               0.2
4           4                5.0  ...                1.4               0.2

[5 rows x 5 columns]

当将此方法与pd.read_csv('gs://my-bucket/file.csv')方法进行比较时，我发现此处描述的方法更明确地表明client = storage.Client()是负责处理的方法身份验证（在使用多个凭据时可能非常方便）。此外，如果您在 Google Cloud Platform 的资源上运行此代码，则storage.Client已经完全安装，对于pd.read_csv('gs://my-bucket/file.csv') ，您需要安装允许 pandas 访问 Google Storage 的gcsfs软件包。

Answer 9

如果我正确理解了你的问题，那么也许这个链接可以帮助你为你的read_csv()函数获得一个更好的URL ：

https://cloud.google.com/storage/docs/access-public-data

Answer 10

如果加载压缩文件，仍然需要使用import gcsfs 。

在 pd 中尝试了pd.read_csv('gs://your-bucket/path/data.csv.gz') 。 version => 0.25.3 出现以下错误，

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    438     # See https://github.com/python/mypy/issues/1297
    439     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440         filepath_or_buffer, encoding, compression
    441     )
    442     kwds["compression"] = compression

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    211 
    212     if is_gcs_url(filepath_or_buffer):
--> 213         from pandas.io import gcs
    214 
    215         return gcs.get_filepath_or_buffer(

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
      3 
      4 gcsfs = import_optional_dependency(
----> 5     "gcsfs", extra="The gcsfs library is required to handle GCS files"
      6 )
      7 

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
     91     except ImportError:
     92         if raise_on_missing:
---> 93             raise ImportError(message.format(name=name, extra=extra)) from None
     94         else:
     95             return None

ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.

Answer 11

谷歌云存储有一个方法download_as_bytes() ，然后，您可以从中读取从字节 HT 到NEWBEDEV的 csv，代码如下所示：

import pandas as pd
from io import BytesIO

blob = storage_client.get_bucket(event['bucket']).get_blob(event['name'])
blobBytes = blob.download_as_bytes()
df = pd.read_csv(BytesIO(blobBytes))

我的event来自一个云存储示例

从 Google Cloud 存储读取 csv 到 pandas 数据框

问题描述

11 个解决方案

解决方案1
152 2018-05-06 15:05:51

更新

困难的方法：自己动手的代码

gcsfs

黎明

解决方案2
23 2018-08-30 22:50:40

将通配符 CSV 读入 Pandas

用法

解决方案3
6 2019-01-17 18:18:23

解决方案4
6 2021-05-03 03:51:15

解决方案5
5 2021-04-01 11:45:18

解决方案6
3 2018-03-19 07:03:17

解决方案7
2 2018-03-19 07:16:54

解决方案8
2 2021-09-21 22:57:55

解决方案9
1 2018-03-19 09:38:32

解决方案10
0 2020-04-24 06:59:56

解决方案11
0 2022-06-15 11:01:43

从 Google Cloud 存储读取 csv 到 pandas 数据框

问题描述

11 个解决方案

解决方案1 152 2018-05-06 15:05:51

更新

困难的方法：自己动手的代码

gcsfs

黎明

解决方案2 23 2018-08-30 22:50:40

将通配符 CSV 读入 Pandas

用法

解决方案3 6 2019-01-17 18:18:23

解决方案4 6 2021-05-03 03:51:15

解决方案5 5 2021-04-01 11:45:18

解决方案6 3 2018-03-19 07:03:17

解决方案7 2 2018-03-19 07:16:54

解决方案8 2 2021-09-21 22:57:55

解决方案9 1 2018-03-19 09:38:32

解决方案10 0 2020-04-24 06:59:56

解决方案11 0 2022-06-15 11:01:43

解决方案1
152 2018-05-06 15:05:51

解决方案2
23 2018-08-30 22:50:40

解决方案3
6 2019-01-17 18:18:23

解决方案4
6 2021-05-03 03:51:15

解决方案5
5 2021-04-01 11:45:18

解决方案6
3 2018-03-19 07:03:17

解决方案7
2 2018-03-19 07:16:54

解决方案8
2 2021-09-21 22:57:55

解决方案9
1 2018-03-19 09:38:32

解决方案10
0 2020-04-24 06:59:56

解决方案11
0 2022-06-15 11:01:43