I have a CSV with which I create a PCollection ( Apache Beam Python ). Is it possible to add an incremental ID to each element of the PCollection?
pcoll = ["Sangeeta,24,Kolkata", "Akshay,26,Delhi", "Sahil,26,Kolkata"]
And what I want is:
pcoll = [ (1, "Sangeeta,24,Kolkata"), (2, "Akshay,26,Delhi"), (3, "Sahil,26,Kolkata")]
Sorry for such a basic question, but I have very little experience with Apache Beam .
You can use beam.combiners.ToList()
to process your pcoll
per element. Use enumerate()
to add an incremental ID but this will start with 0 since this is the default behavior of indices in python.
from apache_beam.options.pipeline_options import PipelineOptions
beam_options = PipelineOptions(
runner='DirectRunner',
)
p = beam.Pipeline(options=beam_options)
process = (p | beam.Create(['Sangeeta,24,Kolkata', 'Akshay,26,Delhi', 'Sahil,26,Kolkata'])
| 'Combine' >> beam.combiners.ToList()
| 'Manipulate' >> beam.Map(lambda my_seq: [(elem) for elem in enumerate(my_seq)])
| 'Print' >> beam.Map(print)
)
result = p.run()
Code above will yield this output:
The main purpose of Beam and PCollections is to enable parallel processing. Putting an index on each element is by nature not parallel. You can do non-parallel processing within Beam (as shown by other answers) but this will not scale to larger data sets and you don't really need Beam to do it.
I suggest that you take a step back to the problem you are trying to solve - why do you want numeric indices with no gaps? There is likely a different way to solve it in parallel.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.