简体   繁体   English

使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢

[英]Incredibly slow read from GCS with GCP Dataflow & Apache Beam Python SDK

Let me preface everything by first stating I am just getting to know my way around working with Beam's Python SDK and GCP Dataflow!让我先说明一下,我只是开始了解使用 Beam 的 Python SDK 和 GCP 数据流的方式!

The issue: My pipeline has been working great for almost all use cases.问题:我的管道几乎适用于所有用例。 No errors that I can complain about.没有我可以抱怨的错误。 I just have some questions about some possible bottlenecks or optimizations I could make.我只是对我可以做的一些可能的瓶颈或优化有一些疑问。 I have noticed that my pipeline execution times jump to almost over 3 hours when working with a gzipped file of size ~50mb.我注意到,在处理大小为 ~50mb 的 gzip 文件时,我的管道执行时间跳到了将近 3 个多小时。 Not entirely sure if there is any way to speed up that part.不完全确定是否有任何方法可以加快该部分的速度。 Below is a screenshot of the log warnings I see a bunch before the job eventually successfully completes.下面是我在作业最终成功完成之前看到的一堆日志警告的屏幕截图。

Log Output details from Dataflow从 Dataflow 记录 Output 详细信息

Here is a relevant snippet of the pipeline:这是管道的相关片段:

if __name__ == '__main__':

    parser = argparse.ArgumentParser(
        description='Beam Pipeline: example'
    )
    parser.add_argument('--input-glob',
                        help='Cloud Path To Raw Data',
                        required=True)
    parser.add_argument('--output-bq-table',
                        help='BQ Native {dataset}.{table}')
    known_args, pipeline_args = parser.parse_known_args()

    with beam.Pipeline(argv=pipeline_args) as p:

        json_raw = (p | 'READING RAW DATA' >> beam.io.ReadFromText(known_args.input_glob)
            | 'JSON-ing' >> beam.Map(lambda e: json.loads(e))
        )

Extra info:额外信息:

  1. I am using Airflow's Dataflowhook to launch this.我正在使用Airflow 的 Dataflowhook来启动它。
  2. I've played around with different machine types hoping that throwing more compute power at the problem would make it go away, but so far no luck.我玩过不同的机器类型,希望在这个问题上投入更多的计算能力会使它远离 go,但到目前为止没有运气。

  3. The following pipeline execution params:以下管道执行参数:

--runner=DataflowRunner \
--project=example-project \
--region=us-east4 \
--save_main_session \
--temp_location=gs://example-bucket/temp/ \
--input-glob=gs://example-bucket/raw/  \
--machine_type=n1-standard-16 \
--job_name=this-is-a-test-insert

Beam has numerous optimizations that allow it to split its processing of files into multiple workers. Beam 进行了大量优化,使其能够将文件处理拆分为多个工作程序。 Furthermore, it is able to split a single file to be consumed by multiple workers in parallel.此外,它能够拆分单个文件以供多个工作人员并行使用。

Unfortunately, this is not possible with gzip files.不幸的是,这对于 gzip 文件是不可能的。 This is because gzip files are compressed into a single block that has to be decompressed serially.这是因为 gzip 文件被压缩成一个块,必须连续解压。 When the Beam worker reads this file, it has to read the whole thing serially.当 Beam worker 读取这个文件时,它必须连续读取整个文件。

There are some compression formats that allow you to read them in parallel (these are usually 'multiblock' formats).有一些压缩格式允许您并行读取它们(这些通常是“多块”格式)。 Unfortunately, I believe the Beam Python SDK only supports serial formats at the moment .不幸的是,我相信 Beam Python SDK目前支持串行格式

If you need to have your pipeline work this way, you could try adding a beam.Reshuffle operation after ReadFromText .如果您需要让管道以这种方式工作,您可以尝试在ReadFromText之后添加一个beam.Reshuffle操作。 By doing this, your pipeline will still read the file serially, but apply all downstream operations in parallel (see this Performance section in the PTransform guide to see why this is the case).通过这样做,您的管道仍将串行读取文件,但并行应用所有下游操作(请参阅PTransform 指南中的性能部分以了解为什么会这样)。

Some other alternatives:其他一些选择:

  • Separate your data into multiple gzip files.将您的数据分成多个 gzip 文件。
  • Decompress it before consuming it in the pipeline.在管道中使用它之前解压缩它。
  • (half joking/half serious) Contribute a multiblock compression support for Beam? (半开玩笑/半认真)为 Beam 提供多块压缩支持? : )):)) :)):))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python 在 GCP Dataflow/Apache Beam Python SDK 中,DoFn.process 是否有时间限制? - In GCP Dataflow/Apache Beam Python SDK, is there a time limit for DoFn.process? apache_beam,在管道期间从 GCS 存储桶中读取数据 - apache_beam, read data from GCS buckets during pipeline Apache Beam With GCP Dataflow 抛出 INVALID_ARGUMENT - Apache Beam With GCP Dataflow throws INVALID_ARGUMENT Spring Cloud Dataflow 与 Apache Beam/GCP 数据流说明 - Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification sdk_container_image apache/beam_python3.9_sdk:2.40.0 的数据流错误 - dataflow error with sdk_container_image apache/beam_python3.9_sdk:2.40.0 从 GCS 读取 CSV 到 class 数据流 Java - Read CSV to a class Dataflow Java from GCS MongoIO Apache 带有 Mongo Upsert 管道示例的光束 GCP 数据流 - MongoIO Apache beam GCP Dataflow with Mongo Upsert Pipeline example Null 在 Apache Beam [DataFlow] 中读取 JdbcIO 的指针异常 - Null Pointer Exception with JdbcIO read in Apache Beam [DataFlow] 使用 Dataflow 和 Apache Beam (Python) 将 Pub/Sub 中的流数据发布到 BigQuery - Issues streaming data from Pub/Sub into BigQuery using Dataflow and Apache Beam (Python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM