簡體   English   中英

為什么我的 Python 數據流作業卡在寫入階段?

[英]Why does my Python Dataflow job gets stuck at the Write phase?

我寫了一個 Python 數據流作業,它設法處理了 300 個文件,不幸的是,當我嘗試在 400 個文件上運行它時,它永遠停留在寫入階段。

日志並不是很有幫助,但我認為問題出在代碼的編寫邏輯上,最初,我只想要 1 output 文件,所以我寫道:

     | 'Write' >> beam.io.WriteToText(
                known_args.output,
                file_name_suffix=".json",
                num_shards=1,
                shard_name_template=""
            ))

然后,我刪除num_shards=1shard_name_template=""並且我能夠處理更多文件,但它仍然會卡住。

額外的信息

  • 要處理的文件很小,不到 1MB
  • 另外,在刪除 num_shards 和 shard_name_template 字段時,我注意到數據在目標路徑中有一個臨時文件夾 output,但工作永遠不會完成
  • 我有以下DEADLINE_EXCEEDED異常,我嘗試通過將 --num_workers 增加到 6 並將 --disk_size_gb 增加到 30 來解決它,但它不起作用。
Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 638, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 63, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 64, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 79, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 80, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "dataflow_worker/shuffle_operations.py", line 82, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 441, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 282, in __next__ return next(self.iterator) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/shuffle.py", line 240, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 133, in shuffle_client.PyShuffleReader.Read OSError: Shuffle read failed: b'DEADLINE_EXCEEDED: (g)RPC timed out when extract-fields-three-mont-10090801-dlaj-harness-fj4v talking to extract-fields-three-mont-10090801-dlaj-harness-1f7r:12346. Server unresponsive (ping error: Deadline Exceeded, {"created":"@1602260204.931126454","description":"Deadline Exceeded","file":"third_party/grpc/src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4}). Typically one can self manage this issue, please read: https://cloud.google.com/dataflow/docs/guides/common-errors#tsg-rpc-timeout'

您能否推薦解決此類問題的方法?

在嘗試抽取資源后,我設法通過啟用 Dataflow shuffle 服務解決了這個問題。 請看資源

只需將--experiments=shuffle_mode=service添加到您的PipelineOptions即可。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM