在 Apache Beam 上將 PCollection 作為側輸入傳遞時出現 KeyError

Question

我將side_input PCollection 作為側輸入傳遞給ParDo變換，但得到相同的ParDo

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import relational_db
from processors.appendcol import AppendCol
from side_inputs.config import sideinput_bq_config
from source.config import source_config


with beam.Pipeline(options=PipelineOptions()) as si:
  side_input = si | "Reading from BQ side input" >> relational_db.ReadFromDB(
    source_config=sideinput_bq_config,
    table_name='abc',
    query="SELECT * FROM abc"
  )

with beam.Pipeline(options=PipelineOptions()) as p:
  PCollection = p | "Reading records from database" >> relational_db.ReadFromDB(
    source_config=source_config,
    table_name='xyzzy',
    query="SELECT * FROM xyzzy",
 ) | beam.ParDo(
   AppendCol(), beam.pvalue.AsIter(side_input)
 )

下面是錯誤

Traceback (most recent call last):
  File "athena/etl.py", line 40, in <module>
    extract()
  File "athena/etl.py", line 22, in extract
    PCollection = p | "Reading records from database" >> relational_db.ReadFromDB(
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/pipeline.py", line 555, in __exit__
    self.result = self.run()
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/pipeline.py", line 534, in run
    return self.runner.run_pipeline(self, self._options)
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/direct/direct_runner.py", line 119, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 175, in run_pipeline
    self._latest_run_result = self.run_via_runner_api(
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 186, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 329, in run_stages
    runner_execution_context = execution.FnApiRunnerExecutionContext(
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/execution.py", line 323, in __init__
    self._build_data_side_inputs_map(stages))
  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/execution.py", line 386, in _build_data_side_inputs_map
    producing_stage = producing_stages_by_pcoll[side_pc]
KeyError: 'ref_PCollection_PCollection_5'

我正在從 PostgreSQL 表中讀取數據，PCollection 的每個元素都是一個字典。

Answer 1

我認為問題在於您有兩個單獨的管道試圖協同工作。 您應該將所有轉換作為單個管道的一部分執行：

with beam.Pipeline(options=PipelineOptions()) as p:
  side_input = p | "Reading from BQ side input" >> relational_db.ReadFromDB(
    source_config=sideinput_bq_config,
    table_name='abc',
    query="SELECT * FROM abc")

  my_pcoll = p | "Reading records from database" >> relational_db.ReadFromDB(
        source_config=source_config,
        table_name='xyzzy',
        query="SELECT * FROM xyzzy",
    ) | beam.ParDo(
        AppendCol(), beam.pvalue.AsIter(side_input))

在 Apache Beam 上將 PCollection 作為側輸入傳遞時出現 KeyError

問題描述

1 個解決方案

解決方案1
3 已采納 2020-09-25 22:41:10

在 Apache Beam 上將 PCollection 作為側輸入傳遞時出現 KeyError

問題描述

1 個解決方案

解決方案1 3 已采納 2020-09-25 22:41:10

解決方案1
3 已采納 2020-09-25 22:41:10