I'm preprocessing data for TensorFlow using Apache Beam. I'd like to choose the number of TFRecord shards based on the number of examples in my dataset. The relevant section of code is:
EXAMPLES_PER_SHARD = 5.0
num_tfexamples = tfexample_strs | "count tf examples" >> beam.combiners.Count.Globally()
num_shards = num_tfexamples | ("compute number of shards" >>
beam.Map(lambda num_examples: int(math.ceil(num_examples / EXAMPLES_PER_SHARD))))
_ = tfexample_strs | ("output to tfrecords" >>
beam.io.WriteToTFRecord(OUTPUT_DIR, num_shards=beam.pvalue.AsSingleton(num_shards)))
This fails with the stacktrace:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/iobase.py", line 1011, in start_bundle
self.counter = random.randint(0, self.count - 1)
TypeError: unsupported operand type(s) for -: 'AsSingleton' and 'int' [while running 'output VALIDATION to tfrecords/Write/WriteImpl/ParDo(_RoundRobinKeyFn)']
I see this line in the class definition of PTransform
# By default, transforms don't have any side inputs.
side_inputs = ()
Is it possible to pass a side input to PTransforms? Thanks for the help
WriteToTFRecord
does not support using a side input for num_shards
. In theory nothing prevents it from doing so (and in the Java SDK it is possible), it's just not implemented in the Python SDK. Feel free to file a JIRA .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.