繁体 English 中英

改进从 apache 光束中的转换内部读取 BQ 数据 - Python

[英]Improving reading BQ data from inside of a transformation in apache beam - Python

原文 2022-01-13 13:15:34 4 1 python/ google-bigquery/ google-cloud-dataflow/ apache-beam

我有一个提交给谷歌云数据流的 apache 光束管道。 我的用例如下：

我有一个 BQ 表 A，我使用 Beam IO 本机连接器连接到 pcollection
我将 pcollection 传递给一个转换，我需要检查每个元素是否存在于 BQ 的表 B 中。 我找不到使用本机 Apache 光束执行查询的方法，这就是我使用谷歌的 BQ python 库的原因
之后，我使用梁 IO 连接器将我想要的结果保存到 BQ 表 C

我的方法效果很好，但是成本很高。 例如，对于包含 150k 条记录的表 A，它至少需要 15 天的处理时间，这在 google 数据流中有 475 名工作人员的情况下被包装成 2 小时。 我知道高成本和长执行时间的主要原因是我从转换内部为每个元素提交的 SQL 查询，因为这需要时间。

你们以前有没有遇到过这样的问题？ 或者你知道我可以在我的代码中发明一个改进来降低成本吗？

1 个解决方案

一个简单优雅的侧面解决方案（我不知道我怎么想不到）

我没有将表 A 直接传递给要处理的梁作业，而是创建了一个新表 A2，它是表 A 和表 B 之间左连接的结果。

因此，我向工作人员请求的数据抛出 SQL 查询，将已经存在于 Job 的输入数据中（表 A2）

这节省了大量的计算资源

Python Apache Beam Datapiple读取多个BQ表

[英]Python Apache Beam Datapiple to read multiple BQ tables

从 MySQL 读取数据并使用 Apache Beam Python ZDB974238714CA8DE634A7ACE1D8714CA8DE634A7ACE 写入 GCP Bucket

[英]Reading data from MySQL and writing in GCP Bucket using Apache Beam Python API

Apache beam：从单个文件读取和转换多种数据类型

[英]Apache beam: Reading and transforming multiple data types from single file

如何在使用 python SDK 将 BIG QUERY 中的数据读取到 apache 光束中的 PCollection 时将源列重命名为目标列名

[英]how to rename the source columns to target column names while reading the data from BIG QUERY into PCollection in apache beam using python SDK

Python apache beam 从数据集中删除元素

[英]Python apache beam remove elements from data set

创建自定义源以使用最新的python apache_beam cloud datafow sdk从云数据存储读取

[英]Creating custom source for reading from cloud datastore using latest python apache_beam cloud datafow sdk

apache-beam 从 GCS 桶的多个文件夹中读取多个文件并加载它 biquery python

[英]apache-beam reading multiple files from multiple folders of GCS buckets and load it biquery python

将 Apache Beam Tagged Output（数据流运行器）写入不同的 BQ 表

[英]Writing Apache Beam Tagged Output (Dataflow runner) to different BQ tables

NotImplementedError Apache Beam python

[英]NotImplementedError apache beam python

从 GCS 读取的 apache 光束 python 和气流导致 TypeError("__init__() got an unexpected keyword argument \\'response_encoding\\'"

[英]apache beam python and airflow reading from GCS results in TypeError("__init__() got an unexpected keyword argument \'response_encoding\'"

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python Apache Beam Datapiple读取多个BQ表从 MySQL 读取数据并使用 Apache Beam Python ZDB974238714CA8DE634A7ACE1D8714CA8DE634A7ACE 写入 GCP Bucket Apache beam：从单个文件读取和转换多种数据类型如何在使用 python SDK 将 BIG QUERY 中的数据读取到 apache 光束中的 PCollection 时将源列重命名为目标列名 Python apache beam 从数据集中删除元素创建自定义源以使用最新的python apache_beam cloud datafow sdk从云数据存储读取 apache-beam 从 GCS 桶的多个文件夹中读取多个文件并加载它 biquery python 将 Apache Beam Tagged Output（数据流运行器）写入不同的 BQ 表 NotImplementedError Apache Beam python 从 GCS 读取的 apache 光束 python 和气流导致 TypeError("__init__() got an unexpected keyword argument \\'response_encoding\\'"

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM