简体   繁体   中英

Can I pass side inputs to Apache Beam PTransforms?

I'm preprocessing data for TensorFlow using Apache Beam. I'd like to choose the number of TFRecord shards based on the number of examples in my dataset. The relevant section of code is:

EXAMPLES_PER_SHARD = 5.0
num_tfexamples = tfexample_strs | "count tf examples" >> beam.combiners.Count.Globally()
num_shards = num_tfexamples | ("compute number of shards" >>
                               beam.Map(lambda num_examples: int(math.ceil(num_examples / EXAMPLES_PER_SHARD))))
_ = tfexample_strs | ("output to tfrecords" >>
                      beam.io.WriteToTFRecord(OUTPUT_DIR, num_shards=beam.pvalue.AsSingleton(num_shards)))

This fails with the stacktrace:

File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/iobase.py", line 1011, in start_bundle
    self.counter = random.randint(0, self.count - 1)
TypeError: unsupported operand type(s) for -: 'AsSingleton' and 'int' [while running 'output VALIDATION to tfrecords/Write/WriteImpl/ParDo(_RoundRobinKeyFn)']

I see this line in the class definition of PTransform

# By default, transforms don't have any side inputs.
side_inputs = ()

Is it possible to pass a side input to PTransforms? Thanks for the help

WriteToTFRecord does not support using a side input for num_shards . In theory nothing prevents it from doing so (and in the Java SDK it is possible), it's just not implemented in the Python SDK. Feel free to file a JIRA .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM