简体   繁体   English

tensorflow 数据验证 tfdv 在谷歌云数据流上失败,出现“无法获取属性‘NumExamplesStatsGenerator’”

[英]tensorflow data validation tfdv fails on google cloud dataflow with "Can't get attribute 'NumExamplesStatsGenerator' "

I am following this "get started" tensorflow tutorial on how to run tfdv on apache beam on google cloud dataflow.我正在关注这个“入门”tensorflow 教程,了解如何在谷歌云数据流上的 apache 光束上运行 tfdv。 My code is very similar to the one in the tutorial:我的代码与教程中的代码非常相似:

import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions

PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'

# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'


# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'

setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]

# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2

print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")

The code above starts a dataflow job but unfortunately it fails with the following error:上面的代码启动了一个数据流作业,但不幸的是它失败并出现以下错误:

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>

I couldn't find on the inte.net anything similar.我在 inte.net 上找不到类似的东西。 If it can help, on my local machine (MACOS) I have the following versions:如果有帮助,在我的本地机器 (MACOS) 上我有以下版本:

Apache Beam version: 2.34.0 Tensorflow version: 2.6.2 TensorFlow Transform version: 1.4.0 TFDV version: 1.4.0 Apache Beam version: 2.34.0 Tensorflow version: 2.6.2 TensorFlow Transform version: 1.4.0 TFDV version: 1.4.0

Apache beam on cloud runs with Apache Beam Python 3.8 SDK 2.34.0 Apache beam on cloud 用Apache Beam Python 3.8 SDK 2.34.0

BONUS QUESTION: Another question I have is around the PATH_TO_WHL_FILE .奖励问题:我的另一个问题是关于PATH_TO_WHL_FILE的。 I tried to put it on a storage bucket but Beam doesn't seem to be able to pick it up.我试着把它放在储物桶上,但 Beam 似乎无法拿起它。 Only locally, which is actually a problem, because it would make it more difficult to distribute this code.仅在本地,这实际上是一个问题,因为这会使分发此代码变得更加困难。 What would be a good practice to distribute this wheel file?分发此 wheel 文件的最佳做法是什么?

Based on the name of the attribute NumExamplesStatsGenerator , it's a generator that is not pickle-able.根据属性NumExamplesStatsGenerator的名称,它是一个不可 pickle 的生成器。

But I couldn't find the attribute from the module now .但是我现在找不到模块的属性。 A search indicates that in 1.4.0 this module contains this attribute.搜索表明在 1.4.0 中此模块包含此属性。 So you may want to try a newer versioned TFDV.所以你可能想尝试更新版本的 TFDV。

PATH_TO_WHL_FILE indicates a file to stage/distribute to Dataflow for execution, so you can use a file on GCS. PATH_TO_WHL_FILE 指示要暂存/分发到 Dataflow 以执行的文件,因此您可以在 GCS 上使用文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM