How to use matplotlib module in Apache Beam Google DataFlow runner

Question

Is it possible to get the matplotlib module from within google dataflow (beam)? I have it in my requirements.txt:

matplotlib==2.0.2

But still get the error:

ImportError: No module named matplotlib

Thanks!

Answer 1

To prepare custom DataFlow workers, you should provide setup.py file with commands to install required packages. First, create setup.py file like the one below (it's a generic setup.py file). You should either list your packages in REQUIRED_PACKAGES variable, or just put pip install matplotlib==2.0.2 in CUSTOM_COMMANDS like I did.

Please note, that matplotlib needs some additional packages/libraries to be installed in the system, so you need to install them too, by specifying install commands for them. Moreover if you want to render plots inside DataFlow job, you will need to configure matplotlib backend to one, which is able to write to file output (see How can I set the 'backend' in matplotlib in Python? ).

Then, after creating setup.py file, just specify Apache Beam pipeline parameter:

import apache_beam as beam
p = beam.Pipeline("DataFlowRunner", argv=[
'--setup_file', './setup.py',
# put other parameters here
])

Generic setup.py file:

import sys
import os
import logging
import subprocess
import pickle

import setuptools
import distutils

from setuptools.command.install import install as _install



class install(_install):  # pylint: disable=invalid-name
    def run(self):
        self.run_command('CustomCommands')
        _install.run(self)

CUSTOM_COMMANDS = [
    ['pip', 'install', 'matplotlib==2.0.2'],
]


class CustomCommands(setuptools.Command):
    """A setuptools Command class able to run arbitrary commands."""

    def initialize_options(self):
        pass

    def finalize_options(self):
        pass

    def RunCustomCommand(self, command_list):
        logging.info('Running command: %s' % command_list)
        p = subprocess.Popen(
            command_list,
            stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        # Can use communicate(input='y\n'.encode()) if the command run requires
        # some confirmation.
        stdout_data, _ = p.communicate()
        logging.info('Command output: %s' % stdout_data)
        if p.returncode != 0:
            raise RuntimeError(
                'Command %s failed: exit code: %s' % (command_list, p.returncode))

    def run(self):
        for command in CUSTOM_COMMANDS:
            self.RunCustomCommand(command)


REQUIRED_PACKAGES = [

]


setuptools.setup(
    name='name',
    version='1.0.0',
    description='DataFlow worker',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
    cmdclass={
        'install': install,
        'CustomCommands': CustomCommands,
        }
    )

How to use matplotlib module in Apache Beam Google DataFlow runner

Question

1 answers

solution1
4 ACCPTED 2017-09-14 12:43:48

How to use matplotlib module in Apache Beam Google DataFlow runner

Question

1 answers

solution1 4 ACCPTED 2017-09-14 12:43:48

solution1
4 ACCPTED 2017-09-14 12:43:48