如何在從Datalab運行的數據流管道中使用Google雲存儲

Question

我們一直在datalab中運行一個Python管道，該管道從Google雲存儲中的存儲桶中讀取圖像文件（導入google.datalab.storage）。 最初，我們使用DirectRunner，但效果很好，但是現在，我們嘗試使用DataflowRunner，並且遇到導入錯誤。 即使我們在管道運行的函數中包含“ import google.datalab.storage”或其任何變體，也會出現諸如“沒有名為“ datalab.storage”的模塊”之類的錯誤。 我們還嘗試了使用save_main_session，requirements_file和setup_file標志，但沒有成功。 我們如何正確訪問數據流管道中雲存儲桶中的圖像文件？

編輯：我最初的錯誤是由於使用錯誤的語法（即“ --requirements_file ./requirements.txt”）指定了requirements_file標志。 我想我已經在那里修復了語法，但是現在我遇到了另一個錯誤。 這是我們嘗試運行的代碼的基本版本-我們有一個管道，可以從Google Cloud的存儲桶中讀取文件。 我們有一個datalab筆記本，其中的單元格包含以下Python代碼：

import apache_beam as beam
from apache_beam.utils.pipeline_options import PipelineOptions
from apache_beam.utils.pipeline_options import GoogleCloudOptions
from apache_beam.utils.pipeline_options import StandardOptions
import google.datalab.storage as storage

bucket = "BUCKET_NAME"
shared_bucket = storage.Bucket(bucket)

# Create and set PipelineOptions. 
options = PipelineOptions(flags = ["--requirements_file", "./requirements.txt"])
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "PROJECT_NAME"
google_cloud_options.job_name = 'test-pipeline-requirements'
google_cloud_options.staging_location = 'gs://BUCKET_NAME/binaries'
google_cloud_options.temp_location = 'gs://BUCKET_NAME/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'

def read_file(input_tuple):
  filepath = input_tuple[0]
  shared_object = shared_bucket.object(filepath)
  f = shared_object.read_stream()
  # More processing of f's contents
  return input_tuple

# File paths relative to the bucket
input_tuples = [("FILEPATH_1", "UNUSED_FILEPATH_2")]
p = beam.Pipeline(options = options)
all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
p.run()

同時，在與筆記本相同的目錄中有一個名為“ requirements.txt”的文件，只有一行

datalab==1.0.1

如果我使用DirectRunner，此代碼可以正常工作。 但是，當我使用DataflowRunner時，在“ p.run（）”處出現CalledProcessError，堆棧跟蹤以以下內容結尾：

_populate_requirements_cache（requirements_file，cache_dir）中的/usr/local/lib/python2.7/dist-packages/apache_beam/runners/dataflow/internal/dependency.pyc
224'--no-binary'，'：all：']
225 logging.info（'執行命令：％s'，cmd_args）
-> 226個進程.check_call（cmd_args）
227
228

/usr/local/lib/python2.7/dist-packages/apache_beam/utils/processes.pyc在check_call（* args，** kwargs）中
38如果force_shell：
39 kwargs ['shell'] =真
---> 40 return subprocess.check_call（* args，** kwargs）
41
42

/usr/lib/python2.7/subprocess.pyc在check_call（* popenargs，** kwargs）中
538如果cmd為None：
539 cmd = popenargs [0]
-> 540提高CalledProcessError（retcode，cmd）
541返回0
542

CalledProcessError：命令'['/ usr / bin / python'，'-m'，'pip'，'install'，'--download'，'/ tmp / dataflow-requirements-cache'，'-r'，' ./requirements.txt'、'--no-binary'、':all：']'返回非零退出狀態1

似乎不建議對pip使用“ --download”選項，但這是apache_beam代碼的一部分。 我還嘗試了通過不同的方式指定“ requirements.txt”，帶有和不帶有“ --save_main_session”標志，帶有和不帶有“ --setup_file”標志，但是沒有骰子。

Answer 1

最可能的問題是您需要讓Dataflow安裝datalab pypi模塊。

通常，您可以通過在上載到Dataflow的requirements.txt文件中列出“ datalab”來做到這一點。 參見https://cloud.google.com/dataflow/pipelines/dependencies-python

Answer 2

如果pydatalab的唯一用途是從GCS讀取，那么我建議使用Dataflow的gcsio。 代碼示例：

def read_file(input_tuple):
  filepath = input_tuple[0]
  with beam.io.gcp.gcsio.GcsIO().open(filepath, 'r') as f:
    # process f content
    pass

# File paths relative to the bucket
input_tuples = [("gs://bucket/file.jpg", "UNUSED_FILEPATH_2")]
p = beam.Pipeline(options = options)
all_files = (p | "Create file path tuple" >> beam.Create(input_tuples))
all_files = (all_files | "Read file" >> beam.FlatMap(read_file))
p.run()

pydatalab非常繁重，因為它更多地是與Datalab或Jupyter一起使用的數據探索庫。 另一方面，管道本身就支持Dataflow的GCSIO。

如何在從Datalab運行的數據流管道中使用Google雲存儲

問題描述

2 個解決方案

解決方案1
3 2017-06-08 02:55:18

解決方案2
3 2017-06-19 17:06:28

如何在從Datalab運行的數據流管道中使用Google雲存儲

問題描述

2 個解決方案

解決方案1 3 2017-06-08 02:55:18

解決方案2 3 2017-06-19 17:06:28

解決方案1
3 2017-06-08 02:55:18

解決方案2
3 2017-06-19 17:06:28