简体   繁体   English

如何将所有 CSV 文件从谷歌云存储位置读取到单个 pandas dataframe?

[英]How to read all CSV files from google cloud storage location into a single pandas dataframe?

I have gcs bucket and can list out all the files in the bucket using google colab like this:-我有 gcs 存储桶,可以使用 google colab 列出存储桶中的所有文件,如下所示:-

!gsutil ls gs://custom_jobs/python_test/

This lists out all the files which are:-这列出了所有文件:-

test_1.csv
test_2.csv

I can read a single file at a time like this:-我可以像这样一次读取一个文件:-

d = pd.read_csv('gs://custom_jobs/python_test/test_1.csv')

What I intend to do is read test_1.csv and test_2.csv in a single dataframe like we can do locally:-我打算做的是阅读test_1.csvtest_2.csv就像我们可以在本地做的那样: -

import glob
files = glob.glob("/home/shashi/python_test/*.csv")

all_dat = pd.DataFrame()

for file in files:
  dat = pd.read_csv(file)
  all_dat = all_dat.append(dat, ignore_index=True)

How is this possible in google colab when my files are on google storage bucket ?当我的文件在google storage bucket上时,这在google colab中怎么可能?

Try using the ls command in gsutil尝试在gsutil中使用ls命令

Ex:前任:

import subprocess

result = subprocess.run(['gsutil', 'ls', 'gs://custom_jobs/python_test/*.csv'], stdout=subprocess.PIPE)

all_dat = pd.DataFrame()
for file in result.stdout.splitlines():
    dat = pd.read_csv(file.strip())
    all_dat = all_dat.append(dat, ignore_index=True)

One simple solution might be:一种简单的解决方案可能是:

from google.cloud import storage

bucket_name = "your-bucket-name"
all_dat = pd.DataFrame()

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)

for blob in blobs:
    dat = pd.read_csv("gs://{}/{}".format(bucket_name, blob.name)) 
    all_dat = all_dat.append(dat, ignore_index=True)



One simple solution that I found was:-我发现的一个简单解决方案是:-

files = !gsutil ls -r gs://custom_jobs/python_test/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Google Cloud 存储读取 csv 到 pandas 数据框 - Read csv from Google Cloud storage to pandas dataframe 从 Google Cloud Storage 将 CSV 文件读取到 Datalab 并转换为 Pandas 数据帧 - Read CSV file to Datalab from Google Cloud Storage and convert to pandas dataframe 如何使用Dask从Google云端存储中读取多个大型CSV文件的块,而不会一次使所有内存过载 - How to read chunks of multiple large CSV files from google cloud storage using Dask without overloading the memory all at once 如何从Google Cloud Storage读取前两行的csv - How to read first 2 rows of csv from Google Cloud Storage 如何只从谷歌云存储中读取 csv 的第一行? - How to read only first row of csv from Google Cloud Storage? 如何使用 Pandas 将 csv 中的多行读取到单个数据帧行中 - How to read multiple lines from csv into a single dataframe row with pandas 如何从谷歌存储中保存的镶木地板文件创建熊猫数据框 - How to create pandas dataframe from parquet files kept on google storage 如何通过 Pandas 从 Google Cloud Function 中的 Google Cloud Storage 访问 csv 文件? - How to access csv file from Google Cloud Storage in a Google Cloud Function via Pandas? 如何将 pandas 数据添加到 Google Cloud Storage 中现有的 csv 文件中? - How to add pandas data to an existing csv file in Google Cloud Storage? 从 Google Cloud Storage 读取带有 Pandas 的 Parquet 元数据 - Read parquet metadata with pandas from Google Cloud Storage
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM