简体   繁体   中英

Add incremental index to a PCollection?

I have a CSV with which I create a PCollection ( Apache Beam Python ). Is it possible to add an incremental ID to each element of the PCollection?

pcoll = ["Sangeeta,24,Kolkata", "Akshay,26,Delhi", "Sahil,26,Kolkata"]

And what I want is:

pcoll = [ (1, "Sangeeta,24,Kolkata"), (2, "Akshay,26,Delhi"), (3, "Sahil,26,Kolkata")]

Sorry for such a basic question, but I have very little experience with Apache Beam .

You can use beam.combiners.ToList() to process your pcoll per element. Use enumerate() to add an incremental ID but this will start with 0 since this is the default behavior of indices in python.

from apache_beam.options.pipeline_options import PipelineOptions

beam_options = PipelineOptions(
    runner='DirectRunner',
)

p = beam.Pipeline(options=beam_options)

process = (p | beam.Create(['Sangeeta,24,Kolkata', 'Akshay,26,Delhi', 'Sahil,26,Kolkata']) 
           | 'Combine' >> beam.combiners.ToList()
           | 'Manipulate' >> beam.Map(lambda my_seq: [(elem) for elem in enumerate(my_seq)])
           | 'Print' >> beam.Map(print)
          )

result = p.run()

Code above will yield this output:

在此处输入图像描述

The main purpose of Beam and PCollections is to enable parallel processing. Putting an index on each element is by nature not parallel. You can do non-parallel processing within Beam (as shown by other answers) but this will not scale to larger data sets and you don't really need Beam to do it.

I suggest that you take a step back to the problem you are trying to solve - why do you want numeric indices with no gaps? There is likely a different way to solve it in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM