如何在 Apache Beam / Google Cloud DataFlow 中处理多个 ParDo 转换对本地文件的操作

Question

I am developing an ETL pipeline for Google Cloud Dataflow where I have several branching ParDo transforms which each require a local audio file.我正在为 Google Cloud Dataflow 开发一个 ETL 管道，其中我有几个分支ParDo转换，每个转换都需要一个本地音频文件。 The branched results are then combined and exported as text.然后将分支结果合并并导出为文本。

This was initially a Python script that ran on a single machine that I am attempting to adapt for VM worker parallelisation using GC Dataflow.这最初是一个 Python 脚本，它在一台机器上运行，我试图使用 GC 数据流来适应 VM 工作者并行化。

The extraction process downloads the files from a single GCS bucket location then deletes them after the transform is completed to keep storage under control.提取过程从单个 GCS 存储桶位置下载文件，然后在转换完成后删除它们以控制存储。 This is due to the pre-processing module which requires local access to the files.这是由于需要对文件进行本地访问的预处理模块。 This could be re-engineered to handle a byte stream instead of a file by rewriting some of the pre-processing libraries myself - however, some attempts at this aren't going well and I'd like to explore first how to handle parallelised local file operations in Apache Beam / GC Dataflow in order to understand the framework better.这可以通过自己重写一些预处理库来重新设计以处理字节 stream 而不是文件 - 但是，对此的一些尝试并不顺利，我想首先探索如何处理并行化的本地Apache Beam / GC Dataflow 中的文件操作，以便更好地理解框架。

In this rough implementation each branch downloads and deletes the files, with lots of double handling.在这个粗略的实现中，每个分支都会下载并删除文件，并进行大量双重处理。 In my implementation I have 8 branches, so each file is being downloaded and deleted 8 times.在我的实现中，我有 8 个分支，因此每个文件都被下载和删除 8 次。 Could a GCS bucket instead be mounted on every worker rather than downloading files from the remote?是否可以将 GCS 存储桶安装在每个工作人员身上，而不是从远程下载文件？

Or is there another way to ensure workers are being passed the correct reference to a file so that:或者是否有另一种方法可以确保将正确的文件引用传递给工作人员，以便：

a single DownloadFilesDoFn() can download a batch单个DownloadFilesDoFn()可以批量下载
then fan out the local file references in PCollection to all the branches然后将PCollection中的本地文件引用扇出到所有分支
and then a final CleanUpFilesDoFn() can remove them然后最终的CleanUpFilesDoFn()可以删除它们
How can you parallelise local file references?如何并行化本地文件引用？

What is the best branched ParDo strategy for Apache Beam / GC Dataflow if local file operations cannot be avoided?如果无法避免本地文件操作，Apache Beam / GC Dataflow 的最佳分支ParDo策略是什么？

Some example code of my existing implementation with two branches for simplicity.为简单起见，我现有实现的一些示例代码带有两个分支。

# singleton decorator
def singleton(cls):
  instances = {}
  def getinstance():
      if cls not in instances:
          instances[cls] = cls()
      return instances[cls]
  return getinstance

@singleton
class Predict():
  def __init__(self, model):
    '''
    Process audio, reads in filename 
    Returns Prediction
    '''
    self.model = model

  def process(self, filename):
      #simplified pseudocode
      audio = preprocess.load(filename=filename)
      prediction = inference(self.model, audio)
      return prediction

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.localfile, self.model = "", model
    
  def process(self, element):
    # Construct Predict() object singleton per worker
    predict = Predict(self.model)

    subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
    self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]

    res = predict.process(self.localfile)
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]    
  def finish_bundle(self):
    subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)


# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
    def process(self, element):
        Index,Title,GCSPath = element.split(",")
        GCSPath = 'gs://mybucket/'+ GCSPath
        return [{
            'Index': int(Index),
            'Title': Title,
            'GCSPath': GCSPath
        }]

A simplified version of the pipeline:管道的简化版本：

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

Answer 1

In order to avoid downloading the files repeatedly, the contents of the files can be put into the pCollection.为了避免重复下载文件，可以将文件内容放入pCollection。

class DownloadFilesDoFn(beam.DoFn):
  def __init__(self):
     import re
     self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')

  def start_bundle(self):
     import google.cloud.storage
     self.gcs = google.cloud.storage.Client()

  def process(self, element):
     file_match = self.gcs_path_regex.match(element['GCSPath'])
     bucket = self.gcs.get_bucket(file_match.group(1))
     blob = bucket.get_blob(file_match.group(2))
     element['file_contents'] = blob.download_as_bytes()
     yield element

Then PredictDoFn becomes:然后 PredictDoFn 变为：

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.model = model

  def start_bundle(self):
    self.predict = Predict(self.model)
    
  def process(self, element):
    res = self.predict.process(element['file_contents'])
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]

and the pipeline:和管道：

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
          | 'Read files' >> beam.ParDo(DownloadFilesDoFn())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

如何在 Apache Beam / Google Cloud DataFlow 中处理多个 ParDo 转换对本地文件的操作

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-11-30 01:42:46

如何在 Apache Beam / Google Cloud DataFlow 中处理多个 ParDo 转换对本地文件的操作

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-11-30 01:42:46

解决方案1
2 已采纳 2020-11-30 01:42:46