使用 python 从 GCP 存储桶中递归读取所有子文件夹中的 csv 个文件

Question

I was trying to load all csv files recursively from all sub folders available in a GCP bucket using python pandas.我试图使用 python pandas 从 GCP 存储桶中可用的所有子文件夹递归加载所有 csv 文件。

Currently I am using dask to load data, but its very slow.目前我正在使用dask加载数据，但速度很慢。

import dask
path = "gs://mybucket/parent_path + "*/*.csv"
getAllDaysData = dask.dataframe.read_csv(path).compute()

Can someone help me with better way.有人可以用更好的方法帮助我。

Answer 1

I would suggest reading into parquet files instead.我建议改为阅读镶木地板文件。 And use pd.read_parquet(file, engine = 'pyarrow') to convert it into a pandas dataframe.并使用pd.read_parquet(file, engine = 'pyarrow')将其转换为 pandas dataframe。

Answer 2

Alternatively you might want to consider loading data into BigQuery first.或者，您可能需要考虑先将数据加载到 BigQuery 中。 You can do something like this, as long as all csv-files have the some structure.只要所有 csv 文件都具有某种结构，您就可以这样做。

uri = f"gs://mybucket/parent_path/*.csv"
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.CSV
)

load_job = client.load_table_from_uri(
    uri,
    'destination_table',
    job_config=job_config,
    location=GCP_LOCATION
)
load_job_result = load_job.result()

使用 python 从 GCP 存储桶中递归读取所有子文件夹中的 csv 个文件

问题描述

2 个解决方案

解决方案1
1 2022-09-27 17:11:45

解决方案2
0 2022-12-08 22:22:57

使用 python 从 GCP 存储桶中递归读取所有子文件夹中的 csv 个文件

问题描述

2 个解决方案

解决方案1 1 2022-09-27 17:11:45

解决方案2 0 2022-12-08 22:22:57

解决方案1
1 2022-09-27 17:11:45

解决方案2
0 2022-12-08 22:22:57