简体   繁体   English

如何在同一管道中使用 Apache beam python 作业从 BigQuery 和文件系统读取数据?

[英]How to read Data form BigQuery and File system using Apache beam python job in same pipeline?

I am trying to read some data from Bigquery and some data from file system using below code.我正在尝试使用以下代码从 Bigquery 读取一些数据和从文件系统中读取一些数据。

 apn = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList() preprocess_rows = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder())

But when I run this pipeline, I am getting below error但是当我运行这个管道时,我遇到了错误

Traceback (most recent call last): File "/etl/dataflow/etlTXLPreprocessor.py", line 125, in run() File "/etl/dataflow/etlTXLPreprocessor.py", line 120, in run p.run().wait_until_finish() File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 443, in replace_all self._replace(override) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apa回溯(最近一次调用):文件“/etl/dataflow/etlTXLPreprocessor.py”,第 125 行,运行() 文件“/etl/dataflow/etlTXLPreprocessor.py”,第 120 行,运行 p.run()。 wait_until_finish() 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 461 行,运行 self._options).run(False) 文件“/etl/dataflow/ venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 474 行,运行中返回 self.runner.run_pipeline(self, self._options) 文件“/etl/dataflow/venv3/lib/python3. 7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/etl/dataflow/venv3/lib/python3.7/site-packages/ apache_beam/runners/direct/direct_runner.py”,第 413 行,在 run_pipeline pipeline.replace_all(_get_transform_overrides(options)) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py” ,第 443 行,在 replace_all self._replace(override) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apa che_beam/pipeline.py", line 340, in _replace self.visit(TransformUpdater(self)) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 503, in visit self._root_transform().visit(visitor, self, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 942, in visit visitor.visit_transform(self) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py", line 338, in visit_transform self._replace_if_needed(transform_node) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.p che_beam/pipeline.py”,第 340 行,在 _replace self.visit(TransformUpdater(self)) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 503 行,在访问 self._root_transform().visit(visitor, self,visited) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit (访问者,管道,访问)文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第939行,在访问部分.visit(访问者,管道,访问)文件“ /etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问部分.visit(访问者,管道,访问)文件“/etl/dataflow/venv3/lib/ python3.7/site-packages/apache_beam/pipeline.py”,第 942 行,在访问visitor.visit_transform(self)文件中“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.py ",第 338 行,在visit_transform self._replace_if_needed(transform_node) 文件中"/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/pipeline.p y", line 301, in _replace_if_needed new_output = replacement_transform.expand(input_node) File "/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 87, in expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) File "apache_beam/runners/common.py", line 360, in apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() takes at least 2 positional arguments (1 given) y”,第 301 行,在 _replace_if_needed new_output = replacement_transform.expand(input_node) 文件“/etl/dataflow/venv3/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py”,第 87 行,在expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) 文件“apache_beam/runners/common.py”,第 360 行,在 apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() 需要至少 2 个位置参数(1给)

But If I run my code like this但是如果我像这样运行我的代码

apn = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList() apn1 = p | beam.io.Read(beam.io.BigQuerySource(query=apn_query, use_standard_sql=True)) | beam.combiners.ToList()

or like this或者像这样

preprocess_rows = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder()) preprocess_rows1 = p | beam.io.ReadFromText(file_path, coder=UnicodeCoder())

I am unable to figure out the error.我无法弄清楚错误。 Is It a limitation to read from the same data source in Apache beam pipeline?从 Apache 光束管道中的同一数据源读取是否有限制?

I am getting the same error when performing the same type of action, pulling in data from BigQuery and the filesystem.执行相同类型的操作时,我遇到了同样的错误,从 BigQuery 和文件系统中提取数据。

lines = p | "Read Input Parameters" >> ReadFromText(options.input)
past_posts = p | "Get Past Posts From BigQuery" >> Read(BigQuerySource(query=f"SELECT url FROM {full_bq_table_id}", use_standard_sql=False))

Error:错误:

Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main " main ", mod_spec) File "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/craigslist_pipeline.py", line 14, in full_bq_table_id=f"apartment-data-project:{dataset}.craigslist_posts" File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/pipeline/ init .py", line 35, in run result = p.run() File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/Users/i回溯(最近一次调用):文件“/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py”,第 193 行,在 _run_module_as_main 中“ main ", mod_spec) 文件 "/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/craigslist_pipeline.py”,第 14 行,在 full_bq_table_id=f“apartment-data-project:{dataset}.craigslist_posts”文件“/Users/ianmitchell/Documents/ Personal Projects/Craigslist/pipeline/ init .py", line 35, in run result = p.run() File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/ apache_beam/pipeline.py", line 461, in run self._options).run(False) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline .py", line 474, in run return self.runner.run_pipeline(self, self._options) File "/Users/i anmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 443, in replace_all self._replace(override) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 340, in _replace self.visit(TransformUpdater(self)) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 503, in visit self._root_transform().visit(visitor, self, visited) File "/Users/ianmitchell/Documen anmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 182, in run_pipeline return runner.run_pipeline(pipeline, options) File "/Users /ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 413, in run_pipeline pipeline.replace_all(_get_transform_overrides(options)) 文件"/用户/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 443 行,在 replace_all self._replace(override) 文件“/Users/ianmitchell/Documents/个人项目/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 340 行,在 _replace self.visit(TransformUpdater(self)) 文件“/Users/ianmitchell/Documents/Personal Projects/ Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 503 行,访问 self._root_transform().visit(visitor, self,visited) 文件“/Users/ianmitchell/Documen ts/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline, visited) [Previous line repeated 1 more time] File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 942, in visit visitor.visit_transform(self) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 338, in visit_transform self._replace_if_needed(transform_node) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packag ts/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit(visitor, pipeline,visited) 文件“/Users/ianmitchell/Documents/ Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 939 行,在访问 part.visit(visitor, pipeline,visited) 文件“/Users/ianmitchell/Documents/Personal Projects /Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py", line 939, in visit part.visit(visitor, pipeline,visited) [上一行重复了 1 次] 文件 "/Users/ ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 942 行,在访问visitor.visit_transform(self)文件中“/Users/ianmitchell/Documents/Personal Projects /Craigslist/env/lib/python3.7/site-packages/apache_beam/pipeline.py”,第 338 行,visit_transform self._replace_if_needed(transform_node) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib /python3.7/site-packag es/apache_beam/pipeline.py", line 301, in _replace_if_needed new_output = replacement_transform.expand(input_node) File "/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/runners/direct/sdf_direct_runner.py", line 87, in expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) File "apache_beam/runners/common.py", line 360, in apache_beam.runners.common.DoFnInvoker.create_invoker TypeError: create_invoker() takes at least 2 positional arguments (1 given) es/apache_beam/pipeline.py”,第 301 行,在 _replace_if_needed new_output = replacement_transform.expand(input_node) 文件“/Users/ianmitchell/Documents/Personal Projects/Craigslist/env/lib/python3.7/site-packages/apache_beam/ runners/direct/sdf_direct_runner.py”,第 87 行,在 expand invoker = DoFnInvoker.create_invoker(signature, process_invocation=False) 文件“apache_beam/runners/common.py”,第 360 行,在 apache_beam.runners.common.DoFnInvoker.create_invoker类型错误:create_invoker() 需要至少 2 个位置参数(给定 1 个)

Wondering why you cannot pull from different sources as well.想知道为什么你不能从不同的来源中提取。

This is a bug in the direct runner in Apache Beam v2.19.这是 Apache Beam v2.19 中直接运行器中的一个错误。 The fix was done but not released, yet.修复已完成但尚未发布。 Downgrade your apache-beam to 2.16 (pip install apache-beam==2.16) and it will work.将您的 apache-beam 降级到 2.16(pip install apache-beam==2.16),它将起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 python 编码的 Apache-Beam 管道中提供 BigQuery 凭证 - Provide BigQuery credentials in Apache-Beam pipeline coded in python 使用管道数据查询 BigQuery apache_beam - Use pipeline data to query BigQuery apache_beam 如何在 Python 中使用 apache beam Pipeline 处理异常? - How can I handle an exception using apache beam Pipeline in Python? 如何使用Apache Beam Python将结果追加到管道中? - How to append result in pipeline using apache beam python? 使用DirectRunner时,Bigquery apache光束管道“挂起” - Bigquery apache beam pipeline “hanging” when using the DirectRunner 如何在 Python 中使用 Apache Beam 读取和操作 Json 文件 - How to read and manipulate a Json file with Apache beam in Python 如何使用 Apache Beam Python 读取 txt 文件并转换为数据帧? - How to read txt files and convert into data frame using Apache Beam Python? 使用 Dataflow 和 Apache Beam (Python) 将 Pub/Sub 中的流数据发布到 BigQuery - Issues streaming data from Pub/Sub into BigQuery using Dataflow and Apache Beam (Python) 如何使用DataflowPythonOperator通过Apache Airflow运行Apache Beam数据管道 - How to run Apache Beam data pipeline through Apache Airflow using DataflowPythonOperator 使用 Apache Beam Python 将 CoGroupByKey 的 output 作为行转换为加载到 BigQuery - Convert output of CoGroupByKey as row to load into BigQuery using Apache Beam Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM