Add incremental index to a PCollection?

Question

I have a CSV with which I create a PCollection ( Apache Beam Python ). Is it possible to add an incremental ID to each element of the PCollection?

pcoll = ["Sangeeta,24,Kolkata", "Akshay,26,Delhi", "Sahil,26,Kolkata"]

And what I want is:

pcoll = [ (1, "Sangeeta,24,Kolkata"), (2, "Akshay,26,Delhi"), (3, "Sahil,26,Kolkata")]

Sorry for such a basic question, but I have very little experience with Apache Beam .

Answer 1

You can use beam.combiners.ToList() to process your pcoll per element. Use enumerate() to add an incremental ID but this will start with 0 since this is the default behavior of indices in python.

from apache_beam.options.pipeline_options import PipelineOptions

beam_options = PipelineOptions(
    runner='DirectRunner',
)

p = beam.Pipeline(options=beam_options)

process = (p | beam.Create(['Sangeeta,24,Kolkata', 'Akshay,26,Delhi', 'Sahil,26,Kolkata']) 
           | 'Combine' >> beam.combiners.ToList()
           | 'Manipulate' >> beam.Map(lambda my_seq: [(elem) for elem in enumerate(my_seq)])
           | 'Print' >> beam.Map(print)
          )

result = p.run()

Code above will yield this output:

Answer 2

The main purpose of Beam and PCollections is to enable parallel processing. Putting an index on each element is by nature not parallel. You can do non-parallel processing within Beam (as shown by other answers) but this will not scale to larger data sets and you don't really need Beam to do it.

I suggest that you take a step back to the problem you are trying to solve - why do you want numeric indices with no gaps? There is likely a different way to solve it in parallel.

Add incremental index to a PCollection?

Question

2 answers

solution1
0 ACCPTED 2022-03-02 08:10:24

solution2
0 2022-03-04 23:51:00

Add incremental index to a PCollection?

Question

2 answers

solution1 0 ACCPTED 2022-03-02 08:10:24

solution2 0 2022-03-04 23:51:00

solution1
0 ACCPTED 2022-03-02 08:10:24

solution2
0 2022-03-04 23:51:00