如何将所有 CSV 文件从谷歌云存储位置读取到单个 pandas dataframe？

Question

I have gcs bucket and can list out all the files in the bucket using google colab like this:-我有 gcs 存储桶，可以使用 google colab 列出存储桶中的所有文件，如下所示：-

!gsutil ls gs://custom_jobs/python_test/

This lists out all the files which are:-这列出了所有文件：-

test_1.csv
test_2.csv

I can read a single file at a time like this:-我可以像这样一次读取一个文件：-

d = pd.read_csv('gs://custom_jobs/python_test/test_1.csv')

What I intend to do is read test_1.csv and test_2.csv in a single dataframe like we can do locally:-我打算做的是阅读test_1.csv和test_2.csv就像我们可以在本地做的那样： -

import glob
files = glob.glob("/home/shashi/python_test/*.csv")

all_dat = pd.DataFrame()

for file in files:
  dat = pd.read_csv(file)
  all_dat = all_dat.append(dat, ignore_index=True)

How is this possible in google colab when my files are on google storage bucket ?当我的文件在google storage bucket上时，这在google colab中怎么可能？

Answer 1

Try using the ls command in gsutil尝试在gsutil中使用ls命令

Ex:前任：

import subprocess

result = subprocess.run(['gsutil', 'ls', 'gs://custom_jobs/python_test/*.csv'], stdout=subprocess.PIPE)

all_dat = pd.DataFrame()
for file in result.stdout.splitlines():
    dat = pd.read_csv(file.strip())
    all_dat = all_dat.append(dat, ignore_index=True)

Answer 2

One simple solution might be:一种简单的解决方案可能是：

from google.cloud import storage

bucket_name = "your-bucket-name"
all_dat = pd.DataFrame()

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)

for blob in blobs:
    dat = pd.read_csv("gs://{}/{}".format(bucket_name, blob.name)) 
    all_dat = all_dat.append(dat, ignore_index=True)

Answer 3

One simple solution that I found was:-我发现的一个简单解决方案是：-

files = !gsutil ls -r gs://custom_jobs/python_test/

如何将所有 CSV 文件从谷歌云存储位置读取到单个 pandas dataframe？

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-04-16 07:52:51

解决方案2
0 2020-04-16 07:55:23

解决方案3
0 2020-04-16 09:49:51

如何将所有 CSV 文件从谷歌云存储位置读取到单个 pandas dataframe？

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-04-16 07:52:51

解决方案2 0 2020-04-16 07:55:23

解决方案3 0 2020-04-16 09:49:51

解决方案1
2 已采纳 2020-04-16 07:52:51

解决方案2
0 2020-04-16 07:55:23

解决方案3
0 2020-04-16 09:49:51