简体   繁体   English

如何在 Apache Beam / Google Cloud DataFlow 中处理多个 ParDo 转换对本地文件的操作

[英]How to handle operations on local files over multiple ParDo transforms in Apache Beam / Google Cloud DataFlow

I am developing an ETL pipeline for Google Cloud Dataflow where I have several branching ParDo transforms which each require a local audio file.我正在为 Google Cloud Dataflow 开发一个 ETL 管道,其中我有几个分支ParDo转换,每个转换都需要一个本地音频文件。 The branched results are then combined and exported as text.然后将分支结果合并并导出为文本。

This was initially a Python script that ran on a single machine that I am attempting to adapt for VM worker parallelisation using GC Dataflow.这最初是一个 Python 脚本,它在一台机器上运行,我试图使用 GC 数据流来适应 VM 工作者并行化。

The extraction process downloads the files from a single GCS bucket location then deletes them after the transform is completed to keep storage under control.提取过程从单个 GCS 存储桶位置下载文件,然后在转换完成后删除它们以控制存储。 This is due to the pre-processing module which requires local access to the files.这是由于需要对文件进行本地访问的预处理模块。 This could be re-engineered to handle a byte stream instead of a file by rewriting some of the pre-processing libraries myself - however, some attempts at this aren't going well and I'd like to explore first how to handle parallelised local file operations in Apache Beam / GC Dataflow in order to understand the framework better.这可以通过自己重写一些预处理库来重新设计以处理字节 stream 而不是文件 - 但是,对此的一些尝试并不顺利,我想首先探索如何处理并行化的本地Apache Beam / GC Dataflow 中的文件操作,以便更好地理解框架。

In this rough implementation each branch downloads and deletes the files, with lots of double handling.在这个粗略的实现中,每个分支都会下载并删除文件,并进行大量双重处理。 In my implementation I have 8 branches, so each file is being downloaded and deleted 8 times.在我的实现中,我有 8 个分支,因此每个文件都被下载和删除 8 次。 Could a GCS bucket instead be mounted on every worker rather than downloading files from the remote?是否可以将 GCS 存储桶安装在每个工作人员身上,而不是从远程下载文件?

Or is there another way to ensure workers are being passed the correct reference to a file so that:或者是否有另一种方法可以确保将正确的文件引用传递给工作人员,以便:

  • a single DownloadFilesDoFn() can download a batch单个DownloadFilesDoFn()可以批量下载
  • then fan out the local file references in PCollection to all the branches然后将PCollection中的本地文件引用扇出到所有分支
  • and then a final CleanUpFilesDoFn() can remove them然后最终的CleanUpFilesDoFn()可以删除它们
  • How can you parallelise local file references?如何并行化本地文件引用?

What is the best branched ParDo strategy for Apache Beam / GC Dataflow if local file operations cannot be avoided?如果无法避免本地文件操作,Apache Beam / GC Dataflow 的最佳分支ParDo策略是什么?


Some example code of my existing implementation with two branches for simplicity.为简单起见,我现有实现的一些示例代码带有两个分支。

# singleton decorator
def singleton(cls):
  instances = {}
  def getinstance():
      if cls not in instances:
          instances[cls] = cls()
      return instances[cls]
  return getinstance

@singleton
class Predict():
  def __init__(self, model):
    '''
    Process audio, reads in filename 
    Returns Prediction
    '''
    self.model = model

  def process(self, filename):
      #simplified pseudocode
      audio = preprocess.load(filename=filename)
      prediction = inference(self.model, audio)
      return prediction

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.localfile, self.model = "", model
    
  def process(self, element):
    # Construct Predict() object singleton per worker
    predict = Predict(self.model)

    subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
    self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]

    res = predict.process(self.localfile)
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]    
  def finish_bundle(self):
    subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)


# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
    def process(self, element):
        Index,Title,GCSPath = element.split(",")
        GCSPath = 'gs://mybucket/'+ GCSPath
        return [{
            'Index': int(Index),
            'Title': Title,
            'GCSPath': GCSPath
        }]

A simplified version of the pipeline:管道的简化版本:

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

In order to avoid downloading the files repeatedly, the contents of the files can be put into the pCollection.为了避免重复下载文件,可以将文件内容放入pCollection。

class DownloadFilesDoFn(beam.DoFn):
  def __init__(self):
     import re
     self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')

  def start_bundle(self):
     import google.cloud.storage
     self.gcs = google.cloud.storage.Client()

  def process(self, element):
     file_match = self.gcs_path_regex.match(element['GCSPath'])
     bucket = self.gcs.get_bucket(file_match.group(1))
     blob = bucket.get_blob(file_match.group(2))
     element['file_contents'] = blob.download_as_bytes()
     yield element
     

Then PredictDoFn becomes:然后 PredictDoFn 变为:

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.model = model

  def start_bundle(self):
    self.predict = Predict(self.model)
    
  def process(self, element):
    res = self.predict.process(element['file_contents'])
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]   

and the pipeline:和管道:

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
          | 'Read files' >> beam.ParDo(DownloadFilesDoFn())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Google Cloud Dataflow / Apache Beam中并行运行多个WriteToBigQuery? - How to run multiple WriteToBigQuery parallel in google cloud dataflow / apache beam? 如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件 - How to read multiple JSON files from GCS bucket in google dataflow apache beam python 如何在 Apache Beam / Cloud Dataflow 中实现回顾 - How to implement a lookback in Apache Beam / Cloud Dataflow 通过使用Google Cloud Dataflow中的Python SDK推断架构来读取和编写avro文件 - Apache Beam - Read and write avro files by inferring schema using Python SDK in Google Cloud Dataflow - Apache Beam 如何在Apache Beam中使用ParDo和DoFn写入GCS - how to write to GCS with a ParDo and a DoFn in apache beam 使用 Apache Beam Python 通过 Google Dataflow 将小型集合输出分发给多个工作人员 - Distribute small collection output to multiple workers using Apache Beam Python over Google Dataflow 使用 Apache Beam python 创建谷歌云数据流模板时出现 RuntimeValueProviderError - RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python 带有 Apache Beam 的 Google Cloud Dataflow 不显示日志 - Google Cloud Dataflow with Apache Beam does not display log 是否可以在 Apache 光束或谷歌云数据流中运行自定义 python 脚本 - Is it possible to run a custom python script in Apache beam or google cloud dataflow 使用 apache 梁/谷歌云数据流读取多行 JSON - Read multiline JSON using apache beam / google cloud dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM