[英]Read csv files recursively in all sub folders from a GCP bucket using python
I was trying to load all csv files recursively from all sub folders available in a GCP bucket using python pandas.我试图使用 python pandas 从 GCP 存储桶中可用的所有子文件夹递归加载所有 csv 文件。
Currently I am using dask to load data, but its very slow.目前我正在使用dask加载数据,但速度很慢。
import dask
path = "gs://mybucket/parent_path + "*/*.csv"
getAllDaysData = dask.dataframe.read_csv(path).compute()
Can someone help me with better way.有人可以用更好的方法帮助我。
I would suggest reading into parquet files instead.我建议改为阅读镶木地板文件。 And use
pd.read_parquet(file, engine = 'pyarrow')
to convert it into a pandas dataframe.并使用
pd.read_parquet(file, engine = 'pyarrow')
将其转换为 pandas dataframe。
Alternatively you might want to consider loading data into BigQuery first.或者,您可能需要考虑先将数据加载到 BigQuery 中。 You can do something like this, as long as all csv-files have the some structure.
只要所有 csv 文件都具有某种结构,您就可以这样做。
uri = f"gs://mybucket/parent_path/*.csv"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV
)
load_job = client.load_table_from_uri(
uri,
'destination_table',
job_config=job_config,
location=GCP_LOCATION
)
load_job_result = load_job.result()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.