[英]How to read all CSV files from google cloud storage location into a single pandas dataframe?
I have gcs bucket and can list out all the files in the bucket using google colab like this:-我有 gcs 存储桶,可以使用 google colab 列出存储桶中的所有文件,如下所示:-
!gsutil ls gs://custom_jobs/python_test/
This lists out all the files which are:-这列出了所有文件:-
test_1.csv
test_2.csv
I can read a single file at a time like this:-我可以像这样一次读取一个文件:-
d = pd.read_csv('gs://custom_jobs/python_test/test_1.csv')
What I intend to do is read test_1.csv
and test_2.csv
in a single dataframe like we can do locally:-我打算做的是阅读
test_1.csv
和test_2.csv
就像我们可以在本地做的那样: -
import glob
files = glob.glob("/home/shashi/python_test/*.csv")
all_dat = pd.DataFrame()
for file in files:
dat = pd.read_csv(file)
all_dat = all_dat.append(dat, ignore_index=True)
How is this possible in google colab
when my files are on google storage bucket
?当我的文件在
google storage bucket
上时,这在google colab
中怎么可能?
Try using the ls
command in gsutil
尝试在
gsutil
中使用ls
命令
Ex:前任:
import subprocess
result = subprocess.run(['gsutil', 'ls', 'gs://custom_jobs/python_test/*.csv'], stdout=subprocess.PIPE)
all_dat = pd.DataFrame()
for file in result.stdout.splitlines():
dat = pd.read_csv(file.strip())
all_dat = all_dat.append(dat, ignore_index=True)
One simple solution might be:一种简单的解决方案可能是:
from google.cloud import storage
bucket_name = "your-bucket-name"
all_dat = pd.DataFrame()
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
dat = pd.read_csv("gs://{}/{}".format(bucket_name, blob.name))
all_dat = all_dat.append(dat, ignore_index=True)
One simple solution that I found was:-我发现的一个简单解决方案是:-
files = !gsutil ls -r gs://custom_jobs/python_test/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.