使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢

Question

Let me preface everything by first stating I am just getting to know my way around working with Beam's Python SDK and GCP Dataflow!让我先说明一下，我只是开始了解使用 Beam 的 Python SDK 和 GCP 数据流的方式！

The issue: My pipeline has been working great for almost all use cases.问题：我的管道几乎适用于所有用例。 No errors that I can complain about.没有我可以抱怨的错误。 I just have some questions about some possible bottlenecks or optimizations I could make.我只是对我可以做的一些可能的瓶颈或优化有一些疑问。 I have noticed that my pipeline execution times jump to almost over 3 hours when working with a gzipped file of size ~50mb.我注意到，在处理大小为 ~50mb 的 gzip 文件时，我的管道执行时间跳到了将近 3 个多小时。 Not entirely sure if there is any way to speed up that part.不完全确定是否有任何方法可以加快该部分的速度。 Below is a screenshot of the log warnings I see a bunch before the job eventually successfully completes.下面是我在作业最终成功完成之前看到的一堆日志警告的屏幕截图。

Log Output details from Dataflow从 Dataflow 记录 Output 详细信息

Here is a relevant snippet of the pipeline:这是管道的相关片段：

if __name__ == '__main__':

    parser = argparse.ArgumentParser(
        description='Beam Pipeline: example'
    )
    parser.add_argument('--input-glob',
                        help='Cloud Path To Raw Data',
                        required=True)
    parser.add_argument('--output-bq-table',
                        help='BQ Native {dataset}.{table}')
    known_args, pipeline_args = parser.parse_known_args()

    with beam.Pipeline(argv=pipeline_args) as p:

        json_raw = (p | 'READING RAW DATA' >> beam.io.ReadFromText(known_args.input_glob)
            | 'JSON-ing' >> beam.Map(lambda e: json.loads(e))
        )

Extra info:额外信息：

I am using Airflow's Dataflowhook to launch this.我正在使用Airflow 的 Dataflowhook来启动它。
I've played around with different machine types hoping that throwing more compute power at the problem would make it go away, but so far no luck.我玩过不同的机器类型，希望在这个问题上投入更多的计算能力会使它远离 go，但到目前为止没有运气。
The following pipeline execution params:以下管道执行参数：

--runner=DataflowRunner \
--project=example-project \
--region=us-east4 \
--save_main_session \
--temp_location=gs://example-bucket/temp/ \
--input-glob=gs://example-bucket/raw/  \
--machine_type=n1-standard-16 \
--job_name=this-is-a-test-insert

Answer 1

Beam has numerous optimizations that allow it to split its processing of files into multiple workers. Beam 进行了大量优化，使其能够将文件处理拆分为多个工作程序。 Furthermore, it is able to split a single file to be consumed by multiple workers in parallel.此外，它能够拆分单个文件以供多个工作人员并行使用。

Unfortunately, this is not possible with gzip files.不幸的是，这对于 gzip 文件是不可能的。 This is because gzip files are compressed into a single block that has to be decompressed serially.这是因为 gzip 文件被压缩成一个块，必须连续解压。 When the Beam worker reads this file, it has to read the whole thing serially.当 Beam worker 读取这个文件时，它必须连续读取整个文件。

There are some compression formats that allow you to read them in parallel (these are usually 'multiblock' formats).有一些压缩格式允许您并行读取它们（这些通常是“多块”格式）。 Unfortunately, I believe the Beam Python SDK only supports serial formats at the moment .不幸的是，我相信 Beam Python SDK目前只支持串行格式。

If you need to have your pipeline work this way, you could try adding a beam.Reshuffle operation after ReadFromText .如果您需要让管道以这种方式工作，您可以尝试在ReadFromText之后添加一个beam.Reshuffle操作。 By doing this, your pipeline will still read the file serially, but apply all downstream operations in parallel (see this Performance section in the PTransform guide to see why this is the case).通过这样做，您的管道仍将串行读取文件，但并行应用所有下游操作（请参阅PTransform 指南中的性能部分以了解为什么会这样）。

Some other alternatives:其他一些选择：

Separate your data into multiple gzip files.将您的数据分成多个 gzip 文件。
Decompress it before consuming it in the pipeline.在管道中使用它之前解压缩它。
(half joking/half serious) Contribute a multiblock compression support for Beam? （半开玩笑/半认真）为 Beam 提供多块压缩支持？ : )):)) :)):))

使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢

问题描述

1 个解决方案

解决方案1
2 2020-05-12 01:22:29

使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢

问题描述

1 个解决方案

解决方案1 2 2020-05-12 01:22:29

解决方案1
2 2020-05-12 01:22:29