简体   繁体   中英

How do I import numpy into an Apache Beam pipeline, running on GCP Dataflow?

I am attempting to write an Apache Beam pipeline using Python (3.7). I am running into issues importing numpy , specifically, attempting to use numpy in a DoFn transformation class I wrote.

When running in GCP DataFlow, I am getting the following error "NameError: name 'numpy' is not defined"

To start, everything works how one would expect when using the DirectRunner. The issue is solely when using the DataFlow runner by GCP.

I believe the problem is related to how scope works in GCP DataFlow, and not the import itself. For example, I can successfully get the import to work if I add it to the "process" method inside my class, but am unsuccessful when I add the import at the top of the file.

I tried both using a requirements file, and a setup.py file as command options for the pipeline, but nothing changed. Again, I don't believe the problem is bringing in numpy, but more to do with DataFlow having unexpected scoping of class/functions.

setup.py file

from __future__ import absolute_import
from __future__ import print_function
import setuptools

REQUIRED_PACKAGES = [
    'numpy',
    'Cython',
    'scipy',
    'google-cloud-bigtable'
]

setuptools.setup(
    name='my-pipeline',
    version='0.0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
)

Overall, I am running into many issues with "scope" that I am hoping someone can help with as the Apache Beam documentation really doesn't cover this to well.

from __future__ import absolute_import
from __future__ import division

import apache_beam as beam
import numpy

class Preprocess(beam.DoFn):

    def process(self, element, *args, **kwargs):
        # Demonstrating how I want to call numpy in the process function
        if numpy.isnan(numpy.sum(element['signal'])):
            return [MyOject(element['signal'])]

def run(argv=None):
    parser = argparse.ArgumentParser()
    args, pipeline_args = parser.parse_known_args(argv)
    options = PipelineOptions(pipeline_args)

    p = beam.Pipeline(options=options)
    messages = (p | beam.io.ReadFromPubSub(subscription=args.input_subscription).with_output_types(bytes))
    lines = messages | 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
    json_messages = lines | "Jsonify" >> beam.Map(lambda x: json.loads(x))

    preprocess_messages = json_messages | "Preprocess" >> beam.ParDo(Preprocess())
    result = p.run()
    result.wait_until_finish()

if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

I expect the pipeline to work similarly to how it does when running locally with the DirectRunner, but instead the scoping/importing works different and is causing my pipeline to crash.

When you launch an Apache Beam DirectRunner python program from your desktop, the program is running on your desktop. You have already installed the numpy library locally. However, you have not informed Dataflow to download and install numpy. That is why your program runs as DirectRunner but fails as DataflowRunner.

Edit/Create a normal Python requirements.txt file and include all dependencies such as numpy. I prefer to use virtualdev, import required packages, make sure that my program runs under DirectRunner and then run pip freeze to create my package list for requirements.txt. Now Dataflow will know what packages to import so that your program runs on the Dataflow cluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM