简体   繁体   English

使用Tensorflow Transform的Apache Beam作业(Python)被Cloud Dataflow杀死

[英]Apache Beam job (Python) using Tensorflow Transform is killed by Cloud Dataflow

I'm trying to run an Apache Beam job based on Tensorflow Transform on Dataflow but its killed. 我正在尝试基于Tensorflow Transform在Dataflow上运行Apache Beam作业,但是它被杀死了。 Someone has experienced that behaviour? 有人经历过这种行为吗? This is a simple example with DirectRunner, that runs ok on my local but fails on Dataflow (I change the runner properly): 这是DirectRunner的简单示例,可以在本地运行,但不能在Dataflow上运行(我正确更改了运行器):

import os
import csv
import datetime
import numpy as np

import tensorflow as tf
import tensorflow_transform as tft

from apache_beam.io import textio
from apache_beam.io import tfrecordio

from tensorflow_transform.beam import impl as beam_impl
from tensorflow_transform.beam import tft_beam_io 
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema

import apache_beam as beam


NUMERIC_FEATURE_KEYS = ['feature_'+str(i) for i in range(2000)]


def _create_raw_metadata():
    column_schemas = {}
    for key in NUMERIC_FEATURE_KEYS:
        column_schemas[key] = dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())

    raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema(column_schemas))

    return raw_data_metadata


def preprocessing_fn(inputs):
    outputs={}

    for key in NUMERIC_FEATURE_KEYS:
        outputs[key] = tft.scale_to_0_1(inputs[key])

    return outputs


def main():

    output_dir = '/tmp/tmp-folder-{}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M%S'))

    RUNNER = 'DirectRunner'

    with beam.Pipeline(RUNNER) as p:
        with beam_impl.Context(temp_dir=output_dir):

            raw_data_metadata = _create_raw_metadata()
            _ = (raw_data_metadata | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(os.path.join(output_dir, 'rawdata_metadata'), pipeline=p))

            m = numpy_dataset = np.random.rand(100,2000)*100
            raw_data = (p
                    | 'CreateTestDataset' >> beam.Create([dict(zip(NUMERIC_FEATURE_KEYS, m[i,:])) for i in range(m.shape[0])]))

            raw_dataset = (raw_data, raw_data_metadata)

            transform_fn = (raw_dataset | 'Analyze' >> beam_impl.AnalyzeDataset(preprocessing_fn))
            _ = (transform_fn | 'WriteTransformFn' >> tft_beam_io.WriteTransformFn(output_dir))

            (transformed_data, transformed_metadata) = ((raw_dataset, transform_fn) | 'Transform' >> beam_impl.TransformDataset())

            transformed_data_coder = tft.coders.ExampleProtoCoder(transformed_metadata.schema)
            _ = transformed_data | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(os.path.join(output_dir, 'train'), file_name_suffix='.gz', coder=transformed_data_coder)

if __name__ == '__main__':
  main()

Also, my production code (not shown) fail with the message: The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs. 另外,我的生产代码(未显示)失败,并显示以下消息: The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs. The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.

Any hint? 有什么提示吗?

The restriction on the pipeline description size is documented here: https://cloud.google.com/dataflow/quotas#limits 有关管道描述大小的限制记录在这里: https : //cloud.google.com/dataflow/quotas#limits

There is a way around that, instead of creating stages for each tensor that goes into tft.scale_to_0_1 we could fuse them by first stacking them together, and then passing them into tft.scale_to_0_1 with 'elementwise=True'. 可以采用一种解决方法,而不是为进入tft.scale_to_0_1的每个张量创建阶段,我们可以通过首先将它们堆叠在一起,然后将它们传递到带有'elementwise = True'的tft.scale_to_0_1中来融合它们。

The result will be the same, because the min and max are computed per 'column' instead of across the whole tensor. 结果将是相同的,因为最小值和最大值是按“列”而不是整个张量计算的。

This would look something like this: 看起来像这样:

stacked = tf.stack([inputs[key] for key in NUMERIC_FEATURE_KEYS], axis=1)
scaled_stacked = tft.scale_to_0_1(stacked, elementwise=True)
for key, tensor in zip(NUMERIC_FEATURE_KEYS, tf.unstack(scaled_stacked, axis=1)):
  outputs[key] = tensor

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Datalab上安装tensorflow_transform和apache_beam - installing tensorflow_transform and apache_beam on Datalab 如何使tf.Transform(用于TensorFlow的Apache Beam预处理)工作? - How to make tf.Transform (Apache Beam Preprocessing for TensorFlow) work? Windows 上的 TensorFlow tensorflow_transform.beam 问题 - Issue with TensorFlow tensorflow_transform.beam on Windows 导入错误:没有名为 tensorflow_transform.beam 的模块 - ImportError: No module named tensorflow_transform.beam TypeError: can't pickle PyCapsule objects when using tf.transform and Apache Beam - TypeError: can't pickle PyCapsule objects when using tf.transform and Apache Beam 如何使用 BigQuery 和 Apache Beam 将 SQL 表转换为行序列列表? - How to transform an SQL table into a list of row sequences using BigQuery and Apache Beam? TensorFlow Python 脚本被杀死 - TensorFlow Python script getting killed 使用 Tensorflow Extended 时,如何使用本地 CSV 文件运行我的 apache 光束管道? - How can i run my apache beam pipeline with a local CSV-File when using Tensorflow Extended? 使用 Tensorflow 后端进行 CTC 波束搜索 - CTC Beam Search using Tensorflow Backend apache beam python sdk是否可以进行状态处理? - Is stateful processing possible with the apache beam python sdk?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM