简体   繁体   English

从 Google Cloud 存储读取 csv 到 pandas 数据框

[英]Read csv from Google Cloud storage to pandas dataframe

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到熊猫数据帧上。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

It shows this error message:它显示此错误消息:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

What am I doing wrong, I am not able to find any solution which does not involve google datalab?我做错了什么,我找不到任何不涉及 google datalab 的解决方案?

UPDATE更新

As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage.从 pandas 0.24 版本开始, read_csv支持直接从 Google Cloud Storage 读取。 Simply provide link to the bucket like this:只需像这样提供指向存储桶的链接:

df = pd.read_csv('gs://bucket/your_path.csv')

The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).然后read_csv将使用gcsfs模块来读取 Dataframe,这意味着它必须被安装(或者你会得到一个指向缺少依赖项的异常)。

I leave three other options for the sake of completeness.为了完整起见,我留下了其他三个选项。

  • Home-made code自制代码
  • gcsfs gcsfs
  • dask黎明

I will cover them below.我将在下面介绍它们。

The hard way: do-it-yourself code困难的方法:自己动手的代码

I have written some convenience functions to read from Google Storage.我写了一些方便的函数来从谷歌存储中读取。 To make it more readable I added type annotations.为了使其更具可读性,我添加了类型注释。 If you happen to be on Python 2, simply remove these and code will work all the same.如果您碰巧使用的是 Python 2,只需删除这些,代码就可以正常工作。

It works equally on public and private data sets, assuming you are authorised.假设您已获得授权,它同样适用于公共和私人数据集。 In this approach you don't need to download first the data to your local drive.在这种方法中,您无需先将数据下载到本地驱动器。

How to use it:如何使用它:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

The code:编码:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs gcsfs

gcsfs is a "Pythonic file-system for Google Cloud Storage". gcsfs是“用于谷歌云存储的 Pythonic 文件系统”。

How to use it:如何使用它:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

dask黎明

Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". Dask “为分析提供高级并行性,为您喜爱的工具实现大规模性能”。 It's great when you need to deal with large volumes of data in Python.当您需要在 Python 中处理大量数据时,它非常棒。 Dask tries to mimic much of the pandas API, making it easy to use for newcomers. Dask 尝试模仿pandas API 的大部分内容,使其易于新手使用。

Here is the read_csv这是read_csv

How to use it:如何使用它:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:另一种选择是使用 TensorFlow,它具有从 Google Cloud Storage 进行流式读取的能力:

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df = pd.read_csv(f)

Using tensorflow also gives you a convenient way to handle wildcards in the filename.使用 tensorflow 还为您提供了一种方便的方法来处理文件名中的通配符。 For example:例如:

Reading wildcard CSV into Pandas将通配符 CSV 读入 Pandas

Here is code that will read all CSVs that match a specific pattern (eg: gs://bucket/some/dir/train-*) into a Pandas dataframe:下面的代码会将与特定模式(例如:gs://bucket/some/dir/train-*)匹配的所有 CSV 读入 Pandas 数据帧:

import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd

def read_csv_file(filename):
  with file_io.FileIO(filename, 'r') as f:
    df = pd.read_csv(f, header=None, names=['col1', 'col2'])
    return df

def read_csv_files(filename_pattern):
  filenames = tf.gfile.Glob(filename_pattern)
  dataframes = [read_csv_file(filename) for filename in filenames]
  return pd.concat(dataframes)

usage用法

DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))

As of pandas==0.24.0 this is supported natively if you have gcsfs installed: https://github.com/pandas-dev/pandas/pull/22704 .pandas==0.24.0 ,如果您安装了gcsfs ,则本机支持: https ://github.com/pandas-dev/pandas/pull/22704。

Until the official release you can try it out with pip install pandas==0.24.0rc1 .在正式发布之前,您可以使用pip install pandas==0.24.0rc1

I was taking a look at this question and didn't want to have to go through the hassle of installing another library, gcsfs , which literally says in the documentation, This software is beta, use at your own risk ... but I found a great workaround that I wanted to post here in case this is helpful to anyone else, using just the google.cloud storage library and some native python libraries.我正在看这个问题,不想费心安装另一个库gcsfs ,它在文档中字面意思是, This software is beta, use at your own risk ......但我发现我想在这里发布一个很好的解决方法,以防这对其他人有帮助,只使用 google.cloud 存储库和一些本机 python 库。 Here's the function:这是功能:

import pandas as pd
from google.cloud import storage
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/creds.json'


def gcp_csv_to_df(bucket_name, source_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    data = blob.download_as_string()
    df = pd.read_csv(io.BytesIO(data))
    print(f'Pulled down file from bucket {bucket_name}, file name: {source_file_name}')
    return df

Further, although it is outside of the scope of this question, if you would like to upload a pandas dataframe to GCP using a similar function, here is the code to do so:此外,虽然它超出了这个问题的范围,但如果您想使用类似的功能将 pandas 数据帧上传到 GCP,以下是执行此操作的代码:

def df_to_gcp_csv(df, dest_bucket_name, dest_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(dest_bucket_name)
    blob = bucket.blob(dest_file_name)
    blob.upload_from_string(df.to_csv(), 'text/csv')
    print(f'DataFrame uploaded to bucket {dest_bucket_name}, file name: {dest_file_name}')

Hope this is helpful!希望这有帮助! I know I'll be using these functions for sure.我知道我肯定会使用这些功能。

Since Pandas 1.2 it's super easy to load files from google storage into a DataFrame.从 Pandas 1.2 开始,将文件从谷歌存储加载到 DataFrame 中非常容易。

If you work on your local machine it looks like this:如果你在本地机器上工作,它看起来像这样:

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "credentials.json"})

It's imported that you add as token the credentials.json file from google.它已导入,您将来自 google 的 credentials.json 文件添加为令牌。

If you work on google cloud do this:如果您在谷歌云上工作,请执行以下操作:

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "cloud"})

read_csv does not support gs:// read_csv不支持gs://

From the documentation :文档中:

The string could be a URL.该字符串可以是一个 URL。 Valid URL schemes include http, ftp, s3, and file.有效的 URL 方案包括 http、ftp、s3 和文件。 For file URLs, a host is expected.对于文件 URL,需要一个主机。 For instance, a local file could be file ://localhost/path/to/table.csv例如,本地文件可以是文件://localhost/path/to/table.csv

You can download the file or fetch it as a string in order to manipulate it.您可以下载文件将其作为字符串获取以进行操作。

There are three ways of accessing files in the GCS:在 GCS 中访问文件的方式有以下三种

  1. Downloading the client library ( this one for you )下载客户端库(这个给你
  2. Using Cloud Storage Browser in the Google Cloud Platform Console在 Google Cloud Platform Console 中使用 Cloud Storage 浏览器
  3. Using gsutil, a command-line tool for working with files in Cloud Storage.使用 gsutil,一种用于处理 Cloud Storage 中文件的命令行工具。

Using Step 1, setup the GSC for your work.使用步骤 1,为您的工作设置GSC。 After which you have to:之后,您必须:

import cloudstorage as gcs
from google.appengine.api import app_identity

Then you have to specify the Cloud Storage bucket name and create read/write functions for to access your bucket:然后,您必须指定 Cloud Storage 存储桶名称并创建读/写函数以访问您的存储桶:

You can find the remaining read/write tutorial here :您可以在此处找到剩余的读/写教程:

Using pandas and google-cloud-storage python packages:使用pandasgoogle-cloud-storage python 包:

First, we upload a file to the bucket in order to get a fully working example:首先,我们将文件上传到存储桶,以获得一个完整的示例:

import pandas as pd
from sklearn.datasets import load_iris

dataset = load_iris()

data_df = pd.DataFrame(
    dataset.data,
    columns=dataset.feature_names)

data_df.head()
Out[1]: 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Upload a csv file to the bucket (GCP credentials setup is required, read more in here ):将 csv 文件上传到存储桶(需要设置 GCP 凭据,在此处阅读更多信息):

from io import StringIO
from google.cloud import storage

bucket_name = 'my-bucket-name' # Replace it with your own bucket name.
data_path = 'somepath/data.csv'

# Get Google Cloud client
client = storage.Client()

# Get bucket object
bucket = client.get_bucket(bucket_name)

# Get blob object (this is pointing to the data_path)
data_blob = bucket.blob(data_path)

# Upload a csv to google cloud storage
data_blob.upload_from_string(
    data_df.to_csv(), 'text/csv')

Now that we have a csv on the bucket, use pd.read_csv by passing the content of the file.现在我们在存储桶上有一个 csv,通过传递文件的内容来使用pd.read_csv

# Read from bucket
data_str = data_blob.download_as_text()

# Instanciate dataframe
data_dowloaded_df = pd.read_csv(StringIO(data_str))

data_dowloaded_df.head()
Out[2]: 
   Unnamed: 0  sepal length (cm)  ...  petal length (cm)  petal width (cm)
0           0                5.1  ...                1.4               0.2
1           1                4.9  ...                1.4               0.2
2           2                4.7  ...                1.3               0.2
3           3                4.6  ...                1.5               0.2
4           4                5.0  ...                1.4               0.2

[5 rows x 5 columns]

When comparing this approach with pd.read_csv('gs://my-bucket/file.csv') approach, I found that the approach described in here makes more explicit that client = storage.Client() is the one taking care of the authentication (which could be very handy when working with multiple credentials).当将此方法与pd.read_csv('gs://my-bucket/file.csv')方法进行比较时,我发现此处描述的方法更明确地表明client = storage.Client()是负责处理的方法身份验证(在使用多个凭据时可能非常方便)。 Also, storage.Client comes already fully installed if you run this code on a resource from Google Cloud Platform, when for pd.read_csv('gs://my-bucket/file.csv') you'll need to have installed the package gcsfs that allow pandas to access Google Storage.此外,如果您在 Google Cloud Platform 的资源上运行此代码,则storage.Client已经完全安装,对于pd.read_csv('gs://my-bucket/file.csv') ,您需要安装允许 pandas 访问 Google Storage 的gcsfs软件包。

If i understood your question correctly then maybe this link can help u get a better URL for your read_csv() function :如果我正确理解了你的问题,那么也许这个链接可以帮助你为你的read_csv()函数获得一个更好的URL

https://cloud.google.com/storage/docs/access-public-data https://cloud.google.com/storage/docs/access-public-data

One will still need to use import gcsfs if loading compressed files.如果加载压缩文件,仍然需要使用import gcsfs

Tried pd.read_csv('gs://your-bucket/path/data.csv.gz') in pd.在 pd 中尝试了pd.read_csv('gs://your-bucket/path/data.csv.gz') version => 0.25.3 got the following error, version => 0.25.3 出现以下错误,

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    438     # See https://github.com/python/mypy/issues/1297
    439     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440         filepath_or_buffer, encoding, compression
    441     )
    442     kwds["compression"] = compression

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    211 
    212     if is_gcs_url(filepath_or_buffer):
--> 213         from pandas.io import gcs
    214 
    215         return gcs.get_filepath_or_buffer(

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
      3 
      4 gcsfs = import_optional_dependency(
----> 5     "gcsfs", extra="The gcsfs library is required to handle GCS files"
      6 )
      7 

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
     91     except ImportError:
     92         if raise_on_missing:
---> 93             raise ImportError(message.format(name=name, extra=extra)) from None
     94         else:
     95             return None

ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.

Google Cloud storage has a method download_as_bytes() , and then, from that you can read a csv from the bytes HT to NEWBEDEV , the code would look like this:谷歌云存储有一个方法download_as_bytes() ,然后,您可以从中读取从字节 HT 到NEWBEDEV的 csv,代码如下所示:

import pandas as pd
from io import BytesIO

blob = storage_client.get_bucket(event['bucket']).get_blob(event['name'])
blobBytes = blob.download_as_bytes()
df = pd.read_csv(BytesIO(blobBytes))

My event comes from a cloud storage example我的event来自一个云存储示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Google Cloud Storage 将 CSV 文件读取到 Datalab 并转换为 Pandas 数据帧 - Read CSV file to Datalab from Google Cloud Storage and convert to pandas dataframe 如何将所有 CSV 文件从谷歌云存储位置读取到单个 pandas dataframe? - How to read all CSV files from google cloud storage location into a single pandas dataframe? 如何从Google Cloud Storage读取前两行的csv - How to read first 2 rows of csv from Google Cloud Storage 如何只从谷歌云存储中读取 csv 的第一行? - How to read only first row of csv from Google Cloud Storage? 从 Google Cloud Storage 读取带有 Pandas 的 Parquet 元数据 - Read parquet metadata with pandas from Google Cloud Storage 使用 pandas 从谷歌云存储读取 hdf 文件 - read hdf file from google cloud storage using pandas 将 Pandas DataFrame 写入 Google Cloud Storage 或 BigQuery - Write a Pandas DataFrame to Google Cloud Storage or BigQuery Google Cloud Storage JSON 到 Pandas Dataframe 到仓库 - Google Cloud Storage JSONs to Pandas Dataframe to Warehouse 在Google Cloud Functions中运行时,从.csv读取数据到Google Cloud Storage的数据帧存在错误 - Reading from .csv to dataframe from Google Cloud Storage has a bug when ran in Google Cloud Functions 如何通过 Pandas 从 Google Cloud Function 中的 Google Cloud Storage 访问 csv 文件? - How to access csv file from Google Cloud Storage in a Google Cloud Function via Pandas?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM