简体   繁体   中英

How to correctly package Apache Beam Project to run on Google Dataflow

I'm struggling a bit to find the best project/code structure to run my Python Based Apache Beam project on Google Dataflow. With my current setup i'm getting everything deployed but as soon as my pipeline receives data trough Googles Pub/Sub it raises exceptions like this:

... some more lines ...
File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1582, in _create_pardo_operation dofn_data = pickler.loads(serialized_fn) 
File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 294, in loads return dill.loads(s) 
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads return load(file, ignore, **kwds) 
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load return Unpickler(file, ignore=ignore, **kwds).load() 
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load obj = StockUnpickler.load(self) 
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 462, in find_class return StockUnpickler.find_class(self, module, name) 

ModuleNotFoundError: No module named 'coruscantbeam.coruscantbeam' passed through: ==> dist_proc/dax/workflow/worker/fnapi_service.cc:631

This is my current project structure:

\coruscantbeam
   __init__.py
  setup.py
  README.md
  \coruscantbeam
    __init__.py
    dataflow.py
    \pardos
      __init__.py
      mypardo.py

To Deploy the project to Google Dataflow i call dataflow.py directly. the file dataflow.py looks like this:

import apache_beam as beam
import os
from .pardos.mypardo import MyPardo

pipeline_options = PipelineOptions(
        streaming=streaming, save_main_session=True,
        setup_file=os.path.join(os.path.dirname(__file__), "..", "setup.py"),
)

def run_beam():
  with beam.Pipeline(options=pipeline_options) as p:
    p | beam.io.ReadFromPubSub(topic=topic) | beam.ParDo(MyPardo())

if __name__ == "__main__":
  run_beam()

This is some "pseudo" code but the original code works locally and i can deploy it to cloud but as mentioned it does not process data cause of the module import errors.

I've already played around with various of structures and this is the one which brought me the closest to something running-ish;)

I finally managed to get it packaged correctly. This is how it looks finally:

\coruscantbeam
  setup.py
  requirements.txt
  README.md
  \coruscantbeam
    __init__.py
    __main__.py
    main.py
    \pardos
      __init__.py
      mypardo.py

and main.py looks like this:

import apache_beam as beam
from apache_beam.options.pipeline_options import (StandardOptions)
import os
from .pardos.mypardo import MyPardo

pipeline_options = PipelineOptions(
        streaming=streaming, save_main_session=True,
        setup_file=os.path.join(os.path.dirname(__file__), "..", "setup.py"),
requirements_file=os.path.join(os.path.dirname(__file__), "..", "requirements.txt"),
)

def run():
  p = beam.Pipeline(options=pipeline_options):
  p | beam.io.ReadFromPubSub(topic=topic) | beam.ParDo(MyPardo())

  pipeline_result = p.run()

  # Used while testing locally
  if pipeline_options.view_as(StandardOptions).runner == "DirectRunner":
    pipeline_result.wait_until_finish()

where the __main__.py looks like this:

import os
import logging

if __name__ == "__main__":
    from .main import run
    logging.getLogger().setLevel(level=logging.INFO)
    run()

finally the setup.py :

import setuptools

setuptools.setup(
    name='coruscantbeam',
    version='1.0',
    author="Indiana Jones",
    author_email="youremail@domain.tld",
    url="https://blabla.com",
    data_files = [('', ['coruscantbeam/some_schema.json'])],
    include_package_data=True,
    packages=setuptools.find_packages(),
)

and everything gets started by invoking: python -m coruscantbeam

I'm quite happy with the result and it looks clean.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM