繁体   English   中英

如何在同一管道中使用 Apache beam python 作业从 BigQuery 和文件系统读取数据?

[英]How to read Data form BigQuery and File system using Apache beam python job in same pipeline?

我正在尝试使用以下代码从 Bigquery 读取一些数据和从文件系统中读取一些数据。

 apn = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList() preprocess_rows = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder())

但是当我运行这个管道时,我遇到了错误

回溯(最近一次调用):文件“/etl/dataflow/etlTXLPreprocessor.py”,第 125 行,运行() 文件“/etl/dataflow/etlTXLPreprocessor.py”,第 120 行,运行 p.run()。 wait_until_finish() 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 461 行,运行 self._options).run(False) 文件“/etl/dataflow/ venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 474 行,运行中返回 self.runner.run_pipeline(self, self._options) 文件“/etl/dataflow/venv3/lib/python3. 7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/ apache_beam/runners/direct/direct_runner.py”,第 413 行,在 run_pipeline pipeline.replace_all(_get_transform_overrides(options)) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py” ,第 443 行,在 replace_all self._replace(override) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apa che_beam/pipeline.py”,第 340 行,在 _replace self.visit(TransformUpdater(self)) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 503 行,在访问 self._root_transform().visit(visitor, self,visited) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit (访问者,管道,访问)文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第939行,在访问部分.visit(访问者,管道,访问)文件“ /etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问部分.visit(访问者,管道,访问)文件“/etl/dataflow/venv3/lib/ python3.7/site-packages/apache_beam/pipeline.py”,第 942 行,在访问visitor.visit_transform(self)文件中“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py ",第 338 行,在visit_transform self._replace_if_needed(transform_node) 文件中"/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.p y”,第 301 行,在 _replace_if_needed new_output = replacement_transform.expand(input_node) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py”,第 87 行,在expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) 文件“apache_beam/runners/common.py”,第 360 行,在 apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() 需要至少 2 个位置参数(1给)

但是如果我像这样运行我的代码

apn = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList() apn1 = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList()

或者像这样

preprocess_rows = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder()) preprocess_rows1 = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder())

我无法弄清楚错误。 从 Apache 光束管道中的同一数据源读取是否有限制?

执行相同类型的操作时,我遇到了同样的错误,从 BigQuery 和文件系统中提取数据。

lines = p | "Read Input Parameters" >> ReadFromText(options.input)
past_posts = p | "Get Past Posts From BigQuery" >> Read(BigQuerySource(query=f"SELECT url FROM {full_bq_table_id}", use_standard_sql=False))

错误:

回溯(最近一次调用):文件“/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”,第 193 行,在 _run_module_as_main 中“ main ", mod_spec) 文件 "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/craigslist_pipeline.py”,第 14 行,在 full_bq_table_id=f“apartment-data-project:{dataset}.craigslist_posts”文件“/Users/ianmitchell/Documents/ Personal Projects/Craigslist/pipeline/ init .py", line 35, in run result = p.run() File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/ apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline .py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/Users/i anmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/Users /ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) 文件"/用户/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 443 行,在 replace_all self._replace(override) 文件“/Users/ianmitchell/Documents/个人项目/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 340 行,在 _replace self.visit(TransformUpdater(self)) 文件“/Users/ianmitchell/Documents/Personal Projects/ Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 503 行,访问 self._root_transform().visit(visitor, self,visited) 文件“/Users/ianmitchell/Documen ts/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit(visitor, pipeline,visited) 文件“/Users/ianmitchell/Documents/ Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit(visitor, pipeline,visited) 文件“/Users/ianmitchell/Documents/Personal Projects /Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline,visited) [上一行重复了 1 次] 文件 "/Users/ ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 942 行,在访问visitor.visit_transform(self)文件中“/Users/ianmitchell/Documents/Personal Projects /Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 338 行,visit_transform self._replace_if_needed(transform_node) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib /python3.7/site-packag es/apache_beam/pipeline.py”,第 301 行,在 _replace_if_needed new_output = replacement_transform.expand(input_node) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/ runners/direct/sdf_direct_runner.py”,第 87 行,在 expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) 文件“apache_beam/runners/common.py”,第 360 行,在 apache_beam.runners.common.DoFnInvoker.create_invoker类型错误:create_invoker() 需要至少 2 个位置参数(给定 1 个)

想知道为什么你不能从不同的来源中提取。

这是 Apache Beam v2.19 中直接运行器中的一个错误。 修复已完成但尚未发布。 将您的 apache-beam 降级到 2.16(pip install apache-beam==2.16),它将起作用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM