Google Cloud Dataflow can't import 'google.cloud.datastore'

Question

This is my import code

from __future__ import absolute_import

import datetime
import json
import logging
import re

import apache_beam as beam
from apache_beam import combiners
from apache_beam.io.gcp.bigquery import parse_table_schema_from_json
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
from apache_beam.pvalue import AsDict
from apache_beam.pvalue import AsSingleton
from apache_beam.options.pipeline_options import PipelineOptions

from google.cloud.proto.datastore.v1 import query_pb2
from google.cloud import datastore
from googledatastore import helper as datastore_helper, PropertyFilter

# datastore entities that we need to perform the mapping computations
#from models import UserPlan, UploadIntervalCount, RollingMonthlyCount

This is what my requirements.txt file looks like

$ cat requirements.txt
Flask==0.12.2
apache-beam[gcp]==2.1.1
gunicorn==19.7.1
google-cloud-dataflow==2.1.1
six==1.10.0
google-cloud-datastore==1.3.0
google-cloud

This is all in the /lib directory. The /lib directory has the following

$ ls -1 lib/google/cloud
__init__.py
_helpers.py
_helpers.pyc
_http.py
_http.pyc
_testing.py
_testing.pyc
bigquery
bigtable
client.py
client.pyc
datastore
dns
environment_vars.py
environment_vars.pyc
error_reporting
exceptions.py
exceptions.pyc
gapic
iam.py
iam.pyc
language
language_v1
language_v1beta2
logging
monitoring
obselete.py
obselete.pyc
operation.py
operation.pyc
proto
pubsub
resource_manager
runtimeconfig
spanner
speech
speech_v1
storage
translate.py
translate.pyc
translate_v2
videointelligence.py
videointelligence.pyc
videointelligence_v1beta1
vision
vision_v1

Notice that both google.cloud.datastore and google.cloud.proto exist in the /lib folder. However, this import line works fine

from google.cloud.proto.datastore.v1 import query_pb2

but this one failed

from google.cloud import datastore

This is the exception (taken from the google cloud dataflow console online)

(9b49615f4d91c1fb): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 166, in execute
    op.start()
  File "apache_beam/runners/worker/operations.py", line 294, in apache_beam.runners.worker.operations.DoOperation.start (apache_beam/runners/worker/operations.c:10607)
    def start(self):
  File "apache_beam/runners/worker/operations.py", line 295, in apache_beam.runners.worker.operations.DoOperation.start (apache_beam/runners/worker/operations.c:10501)
    with self.scoped_start_state:
  File "apache_beam/runners/worker/operations.py", line 300, in apache_beam.runners.worker.operations.DoOperation.start (apache_beam/runners/worker/operations.c:9702)
    pickler.loads(self.spec.serialized_fn))
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 225, in loads
    return dill.loads(s)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in loads
    return load(file)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1133, in load_reduce
    value = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 767, in _import_module
    return getattr(__import__(module, None, None, [obj]), obj)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_pipeline/counters_pipeline.py", line 25, in <module>
    from google.cloud import datastore
ImportError: No module named datastore

Why can't it find the package?

Answer 1

External dependencies must be installed in setup.py and this file should be specified in pipeline parameters as --setup_file . In the setup.py you can either install you package by using custom command

pip install google-cloud-datastore==1.3.0

or by adding you package into REQUIRED_PACKAGES :

REQUIRED_PACKAGES = ["google-cloud-datastore==1.3.0"]

The reason why you need to specify it in setup.py is because libraries you have in appengine_config are not used during the DataFlow execution. App Engine only acts as a scheduler here, which only deploys job to DataFlow engine. Then, DataFlow creates some worker machines which execute your pipeline - those workers are not connected by any means to the App Engine. DataFlow workers must have every package required for your pipeline to execute, that's why you need to specify required packages in the setup.py file. DataFlow workers use this file to "setup themselves".

Google Cloud Dataflow can't import 'google.cloud.datastore'

Question

1 answers

solution1
2 ACCPTED 2017-10-22 09:01:21

Google Cloud Dataflow can't import 'google.cloud.datastore'

Question

1 answers

solution1 2 ACCPTED 2017-10-22 09:01:21

solution1
2 ACCPTED 2017-10-22 09:01:21