簡體   English   中英

tensorflow 數據驗證 tfdv 在谷歌雲數據流上失敗,出現“無法獲取屬性‘NumExamplesStatsGenerator’”

[英]tensorflow data validation tfdv fails on google cloud dataflow with "Can't get attribute 'NumExamplesStatsGenerator' "

我正在關注這個“入門”tensorflow 教程,了解如何在谷歌雲數據流上的 apache 光束上運行 tfdv。 我的代碼與教程中的代碼非常相似:

import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions

PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'

# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'


# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'

setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]

# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2

print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")

上面的代碼啟動了一個數據流作業,但不幸的是它失敗並出現以下錯誤:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>

我在 inte.net 上找不到類似的東西。 如果有幫助,在我的本地機器 (MACOS) 上我有以下版本:

Apache Beam version: 2.34.0 Tensorflow version: 2.6.2 TensorFlow Transform version: 1.4.0 TFDV version: 1.4.0

Apache beam on cloud 用Apache Beam Python 3.8 SDK 2.34.0

獎勵問題:我的另一個問題是關於PATH_TO_WHL_FILE的。 我試着把它放在儲物桶上,但 Beam 似乎無法拿起它。 僅在本地,這實際上是一個問題,因為這會使分發此代碼變得更加困難。 分發此 wheel 文件的最佳做法是什么?

根據屬性NumExamplesStatsGenerator的名稱,它是一個不可 pickle 的生成器。

但是我現在找不到模塊的屬性。 搜索表明在 1.4.0 中此模塊包含此屬性。 所以你可能想嘗試更新版本的 TFDV。

PATH_TO_WHL_FILE 指示要暫存/分發到 Dataflow 以執行的文件,因此您可以在 GCS 上使用文件。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM