簡體 English 中英

使用谷歌雲中的數據流從雲存儲中讀取數百萬個文件的問題

[英]Issue with reading millions of files from cloud storage using dataflow in Google cloud

原文 2022-07-22 21:14:19 3 2 google-cloud-platform/ google-cloud-storage/ google-cloud-dataflow/ apache-beam/ google-cloud-pubsub

場景：我正在嘗試讀取文件並將數據發送到 pub/sub

數百萬個文件存儲在雲存儲文件夾 (GCP) 中
我使用來自 pub/sub 主題的模板“雲存儲上的文本文件到 Pub/Sub”創建了一個數據流管道
但是上面的模板無法讀取數百萬個文件並失敗並出現以下錯誤
java.lang.IllegalArgumentException: Total size of the BoundedSource objects generated by split() operation is larger than the allowable limit. When splitting gs://filelocation/data/*.json into bundles of 28401539859 bytes it generated 2397802 BoundedSource objects with total serialized size of 199603686 bytes which is larger than the limit 20971520.

系統配置： Apache 光束：2.38 Java SDK 機器：高性能n1-highmem-16

關於如何解決這個問題的任何想法？ 提前致謝

2 個解決方案

根據本文檔 (1)，您可以通過修改自定義BoundedSource子類來解決此問題，以便生成的BoundedSource對象變得小於 20 MB 限制。

(1) https://cloud.google.com/dataflow/docs/guides/common-errors#boundedsource-objects-splitintobundles

您還可以使用TextIO.readAll()來避免這些限制。

從谷歌雲存儲桶中讀取文件

[英]Reading files from google cloud storage bucket

一旦使用 apache 光束 sdk 在 Google Cloud 中創建數據流作業，我們可以從雲存儲桶中刪除 tmp 文件嗎？

[英]Once dataflow job is created in Google Cloud using apache beam sdk, can we delete the tmp files from cloud storage bucket?

將文件從 Azure blob 存儲移動到 Google 雲存儲桶

[英]Moving Files from Azure blob storage to Google cloud storage bucket

使用簽名 URL 上傳到谷歌雲存儲時遇到問題

[英]Facing issue while uploading to Google cloud storage using signed URL

允許用戶從谷歌雲存儲下載單個和批量文件

[英]Alllow users to download single & bulk files from google cloud storage

在 Python/Django 中從 Google Cloud Storage/ Buckets 上傳和檢索文件

[英]Upload and retrieve files from Google Cloud Storage/ Buckets in Python/Django

python 從谷歌雲存儲中刪除以開頭的文件

[英]python delete files from google cloud storage that starts with

從 GOOGLE CLOUD STORAGE BUCKET 下載多個文件

[英]Download multiple files from GOOGLE CLOUD STORAGE BUCKET

從 Google Cloud Storage 中的目錄並行上傳文件

[英]Parallel upload of files from a directory in Google Cloud Storage

“Cloud Storage 上的 Parquet 文件到 Cloud Bigtable”DataFlow 模板無法讀取 parquet 文件

[英]The "Parquet Files on Cloud Storage to Cloud Bigtable" DataFlow template cannot read parquet files

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 從谷歌雲存儲桶中讀取文件一旦使用 apache 光束 sdk 在 Google Cloud 中創建數據流作業，我們可以從雲存儲桶中刪除 tmp 文件嗎？將文件從 Azure blob 存儲移動到 Google 雲存儲桶使用簽名 URL 上傳到谷歌雲存儲時遇到問題允許用戶從谷歌雲存儲下載單個和批量文件在 Python/Django 中從 Google Cloud Storage/ Buckets 上傳和檢索文件 python 從谷歌雲存儲中刪除以開頭的文件從 GOOGLE CLOUD STORAGE BUCKET 下載多個文件從 Google Cloud Storage 中的目錄並行上傳文件 “Cloud Storage 上的 Parquet 文件到 Cloud Bigtable”DataFlow 模板無法讀取 parquet 文件

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM