![](/img/trans.png)
[英]Apache Beam with Dataflow: flag 'ignore_unknown_columns' for WriteToBigQuery not working
[英]In Apache Beam/Dataflow's WriteToBigQuery transform, how do you enable the deadletter pattern with Method.FILE_LOADS and Avro temp_file_format
在本文档中,Apache Beam 在写入 BigQuery 时建议使用死信模式。 此模式允许您从带有'FailedRows'
标签的变换 output 中获取未能写入的行。
但是,当我尝试使用它时:
WriteToBigQuery(
table=self.bigquery_table_name,
schema={"fields": self.bigquery_table_schema},
method=WriteToBigQuery.Method.FILE_LOADS,
temp_file_format=FileFormat.AVRO,
)
我的一个元素中的架构不匹配会导致以下异常:
Error message from worker: Traceback (most recent call last):
File
"/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1630,
in write self._avro_writer.write(row) File "fastavro/_write.pyx", line 647,
in fastavro._write.Writer.write File "fastavro/_write.pyx", line 376,
in fastavro._write.write_data File "fastavro/_write.pyx", line 320,
in fastavro._write.write_record File "fastavro/_write.pyx", line 374,
in fastavro._write.write_data File "fastavro/_write.pyx", line 283,
in fastavro._write.write_union ValueError: [] (type <class 'list'>) do not match ['null', 'double'] on field safety_proxy During handling of the above exception, another exception occurred: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198,
in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 718,
in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 841,
in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1334,
in apache_beam.runners.common._OutputProcessor.process_outputs File "/my_code/apache_beam/io/gcp/bigquery_file_loads.py", line 258,
in process writer.write(row) File "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1635,
in write ex, self._avro_writer.schema, row)).with_traceback(tb) File "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1630,
in write self._avro_writer.write(row) File "fastavro/_write.pyx", line 647,
in fastavro._write.Writer.write File "fastavro/_write.pyx", line 376,
in fastavro._write.write_data File "fastavro/_write.pyx", line 320,
in fastavro._write.write_record File "fastavro/_write.pyx", line 374,
in fastavro._write.write_data File "fastavro/_write.pyx", line 283,
in fastavro._write.write_union ValueError: Error writing row to Avro: [] (type <class 'list'>) do not match ['null', 'double'] on field safety_proxy Schema: ...
据我所知,架构不匹配导致fastavro._write.Writer.write
失败并引发异常。 相反,我希望WriteToBigQuery
应用死信行为并将我的格式错误的行返回为标记为FailedRows
的 FailedRows。 有没有办法做到这一点?
谢谢
编辑:添加我正在尝试做的更详细的示例:
from apache_beam import Create
from apache_beam.io.gcp.bigquery import BigQueryWriteFn, WriteToBigQuery
from apache_beam.io.textio import WriteToText
...
valid_rows = [{"some_field_name": i} for i in range(1000000)]
invalid_rows = [{"wrong_field_name": i}]
pcoll = Create(valid_rows + invalid_rows)
# This fails because of the 1 invalid row
write_result = (
pcoll
| WriteToBigQuery(
table=self.bigquery_table_name,
schema={
"fields": [
{'name': 'some_field_name', 'type': 'INTEGER', 'mode': 'NULLABLE'},
]
},
method=WriteToBigQuery.Method.FILE_LOADS,
temp_file_format=FileFormat.AVRO,
)
)
# What I want is for WriteToBigQuery to partially succeed and output the failed rows.
# This is because I have pipelines that run for multiple hours and fail because of
# a small amount of malformed rows
(
write_result[BigQueryWriteFn.FAILED_ROWS]
| WriteToText('gs://my_failed_rows/')
)
让我们稍微退后一步来讨论期望的目标和结果。
为什么需要“FILE_LOADS”作为大查询写入方法?
您是否也知道 BigQuery 存储写入 API: https://cloud.google.com/bigquery/docs/write-api
It looks like the java sdk supports the BQ Write API, but not currently the python sdk. 我相信使用写入 API 将通过 gRPC 连接以写入 BigQuery,而不是需要序列化到 avro 然后调用 [ legacy ] 批处理加载过程?
也许看看是否有帮助 - 模式很重要,但似乎 AVRO 与您的目标无关,只是因为您正在调用的代码?
我只习惯于考虑与流媒体相关的“死信”模式。 这可能只是一个术语挑剔,因为您的信息非常清楚您的意图。
您可以在管道中使用死信队列,而不是让BigQuery
为您捕获错误。 Beam
提出了一种使用TupleTags
进行错误处理和死信队列的本地方法,但代码有点冗长。
I created an open source library called Asgarde
for Python sdk
and Java sdk
to apply error handling for less code, more concise and expressive code:
https://github.com/tosun-si/pasgarde
(还有 Java 版本: https://github.com/tosun-si/asgarde )
您可以使用 pip 安装它:
asgarde==0.16.0
pip install asgarde==0.16.0
from apache_beam import Create
from apache_beam.io.gcp.bigquery import BigQueryWriteFn, WriteToBigQuery
from apache_beam.io.textio import WriteToText
from asgarde.collection_composer import CollectionComposer
def validate_row(self, row) -> Dict :
field = row['your_field']
if field is None or field == '':
# You can raise your own custom exception
raise ValueError('Bad field')
...
valid_rows = [{"some_field_name": i} for i in range(1000000)]
invalid_rows = [{"wrong_field_name": i}]
pcoll = Create(valid_rows + invalid_rows)
# Dead letter queue proposed by Asgarde, it's return output and Failure PCollection.
output_pcoll, failure_pcoll = (CollectionComposer.of(pcoll)
.map(self.validate_row))
# Good sink
(
output_pcoll
| WriteToBigQuery(
table=self.bigquery_table_name,
schema={
"fields": [
{'name': 'some_field_name', 'type': 'INTEGER', 'mode': 'NULLABLE'},
]
},
method=WriteToBigQuery.Method.FILE_LOADS
)
)
# Bad sink : PCollection[Failure] / Failure contains inputElement and
# stackTrace.
(
failure_pcoll
| beam.Map(lambda failure : self.your_failure_transformation(failure))
| WriteToBigQuery(
table=self.bigquery_table_name,
schema=your_schema_for_failure_table,
method=WriteToBigQuery.Method.FILE_LOADS
)
)
Asgarde
lib提出的Failure
object的结构:
@dataclass
class Failure:
pipeline_step: str
input_element: str
exception: Exception
在validate_row
function 中,您将应用验证逻辑并检测错误字段。 在这种情况下,您将引发异常, Asgarde
将为您捕获错误。
CollectionComposer
流程的结果是:
PCollection
的PCollection,在这种情况下,我认为是PCollection[Dict]
PCollection[Failure]
最后,您可以处理到多接收器:
您还可以使用本机Beam
错误处理和TupleTags
应用相同的逻辑,我在我的Github
存储库的项目中提出了一个示例:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.