简体   繁体   中英

How to get one element from a Pcollection in Apache Beam

considering a list of Pcollection:

[{'id':'1','name':'Tom','country':'USA'},{'id':'2','name':'Oprah','country':'USA'}....]

I want to count the occurrence of every country. The result should be something like this:

{'USA':2, 'Tunisia':3, 'France':1}

Check beam.combiners.ToDict , which produces a dict as a result;

Example:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

p = beam.Pipeline(options=PipelineOptions()) 

(p  
| "create pcoll" >> beam.Create([{'id':'1','name':'Tom','country':'USA'},
                                                {'id':'2','name':'Oprah','country':'USA'},
                                                {'id':'2','name':'Oprah','country':'Italy'}])
| "map" >> beam.Map(lambda x: (x['country']))
| "count" >> beam.combiners.Count.PerElement()
| "toDict" >> beam.combiners.ToDict()
| "print" >> beam.Map(print)
) 

p.run()

# Result {'USA': 2, 'Italy': 1}

This is similar to the word count example. You can find an implementation in python here - https://beam.apache.org/get-started/wordcount-example/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM