[英]How to read Data form BigQuery and File system using Apache beam python job in same pipeline?
I am trying to read some data from Bigquery and some data from file system using below code.我正在尝试使用以下代码从 Bigquery 读取一些数据和从文件系统中读取一些数据。
apn = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList() preprocess_rows = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder())
But when I run this pipeline, I am getting below error但是当我运行这个管道时,我遇到了错误
Traceback (most recent call last): File "/etl/dataflow/etlTXLPreprocessor.py", line 125, in run() File "/etl/dataflow/etlTXLPreprocessor.py", line 120, in run p.run().wait_until_finish() File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 443, in replace_all self._replace(override) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apa
回溯(最近一次调用):文件“/etl/dataflow/etlTXLPreprocessor.py”,第 125 行,运行() 文件“/etl/dataflow/etlTXLPreprocessor.py”,第 120 行,运行 p.run()。 wait_until_finish() 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 461 行,运行 self._options).run(False) 文件“/etl/dataflow/ venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 474 行,运行中返回 self.runner.run_pipeline(self, self._options) 文件“/etl/dataflow/venv3/lib/python3. 7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/ apache_beam/runners/direct/direct_runner.py”,第 413 行,在 run_pipeline pipeline.replace_all(_get_transform_overrides(options)) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py” ,第 443 行,在 replace_all self._replace(override) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apa che_beam/pipeline.py", line 340, in _replace self.visit(TransformUpdater(self)) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 503, in visit self._root_transform().visit(visitor, self, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 942, in visit visitor.visit_transform(self) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 338, in visit_transform self._replace_if_needed(transform_node) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.p
che_beam/pipeline.py”,第 340 行,在 _replace self.visit(TransformUpdater(self)) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 503 行,在访问 self._root_transform().visit(visitor, self,visited) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit (访问者,管道,访问)文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第939行,在访问部分.visit(访问者,管道,访问)文件“ /etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问部分.visit(访问者,管道,访问)文件“/etl/dataflow/venv3/lib/ python3.7/site-packages/apache_beam/pipeline.py”,第 942 行,在访问visitor.visit_transform(self)文件中“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py ",第 338 行,在visit_transform self._replace_if_needed(transform_node) 文件中"/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.p y", line 301, in _replace_if_needed new_output = replacement_transform.expand(input_node) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 87, in expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) File "apache_beam/runners/common.py", line 360, in apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() takes at least 2 positional arguments (1 given)
y”,第 301 行,在 _replace_if_needed new_output = replacement_transform.expand(input_node) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py”,第 87 行,在expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) 文件“apache_beam/runners/common.py”,第 360 行,在 apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() 需要至少 2 个位置参数(1给)
But If I run my code like this但是如果我像这样运行我的代码
apn = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList() apn1 = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList()
or like this或者像这样
preprocess_rows = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder()) preprocess_rows1 = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder())
I am unable to figure out the error.我无法弄清楚错误。 Is It a limitation to read from the same data source in Apache beam pipeline?
从 Apache 光束管道中的同一数据源读取是否有限制?
I am getting the same error when performing the same type of action, pulling in data from BigQuery and the filesystem.执行相同类型的操作时,我遇到了同样的错误,从 BigQuery 和文件系统中提取数据。
lines = p | "Read Input Parameters" >> ReadFromText(options.input)
past_posts = p | "Get Past Posts From BigQuery" >> Read(BigQuerySource(query=f"SELECT url FROM {full_bq_table_id}", use_standard_sql=False))
Error:错误:
Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main " main ", mod_spec) File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/craigslist_pipeline.py", line 14, in full_bq_table_id=f"apartment-data-project:{dataset}.craigslist_posts" File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/pipeline/ init .py", line 35, in run result = p.run() File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/Users/i
回溯(最近一次调用):文件“/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”,第 193 行,在 _run_module_as_main 中“ main ", mod_spec) 文件 "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/craigslist_pipeline.py”,第 14 行,在 full_bq_table_id=f“apartment-data-project:{dataset}.craigslist_posts”文件“/Users/ianmitchell/Documents/ Personal Projects/Craigslist/pipeline/ init .py", line 35, in run result = p.run() File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/ apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline .py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/Users/i anmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 443, in replace_all self._replace(override) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 340, in _replace self.visit(TransformUpdater(self)) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 503, in visit self._root_transform().visit(visitor, self, visited) File "/Users/ianmitchell/Documen
anmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/Users /ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) 文件"/用户/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 443 行,在 replace_all self._replace(override) 文件“/Users/ianmitchell/Documents/个人项目/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 340 行,在 _replace self.visit(TransformUpdater(self)) 文件“/Users/ianmitchell/Documents/Personal Projects/ Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 503 行,访问 self._root_transform().visit(visitor, self,visited) 文件“/Users/ianmitchell/Documen ts/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) [Previous line repeated 1 more time] File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 942, in visit visitor.visit_transform(self) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 338, in visit_transform self._replace_if_needed(transform_node) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packag
ts/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit(visitor, pipeline,visited) 文件“/Users/ianmitchell/Documents/ Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit(visitor, pipeline,visited) 文件“/Users/ianmitchell/Documents/Personal Projects /Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline,visited) [上一行重复了 1 次] 文件 "/Users/ ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 942 行,在访问visitor.visit_transform(self)文件中“/Users/ianmitchell/Documents/Personal Projects /Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 338 行,visit_transform self._replace_if_needed(transform_node) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib /python3.7/site-packag es/apache_beam/pipeline.py", line 301, in _replace_if_needed new_output = replacement_transform.expand(input_node) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 87, in expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) File "apache_beam/runners/common.py", line 360, in apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() takes at least 2 positional arguments (1 given)
es/apache_beam/pipeline.py”,第 301 行,在 _replace_if_needed new_output = replacement_transform.expand(input_node) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/ runners/direct/sdf_direct_runner.py”,第 87 行,在 expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) 文件“apache_beam/runners/common.py”,第 360 行,在 apache_beam.runners.common.DoFnInvoker.create_invoker类型错误:create_invoker() 需要至少 2 个位置参数(给定 1 个)
Wondering why you cannot pull from different sources as well.想知道为什么你不能从不同的来源中提取。
This is a bug in the direct runner in Apache Beam v2.19.这是 Apache Beam v2.19 中直接运行器中的一个错误。 The fix was done but not released, yet.
修复已完成但尚未发布。 Downgrade your apache-beam to 2.16 (pip install apache-beam==2.16) and it will work.
将您的 apache-beam 降级到 2.16(pip install apache-beam==2.16),它将起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.