简体   繁体   中英

Sorting apache beam wordcount_minimal output

I'm working through beams word count examples (in python). I am able to run the example on DataflowRunner and receive an output.

The output files currently look like:

itself: 16
grey: 1
senses: 4
repair: 1
me: 228

Is there anyway to sort a PCollection so that my output files are sorted in descending order based on word frequency?

In the case that there is no way to do this, what is the standard workflow to find the most frequently occurring words? Would this be handled by a separate process after beam reduces the data down to word counts?

In Beam the elements of a PCollection are unordered. I'd store the results in a database and perform the sorting there.

Not sure about your use case and if it is really necessary to sort within Beam, but a workaround can be grouping all the rows on a fictitious key, use GroupByKey, and perform the sorting on the grouped data, as follows:

word_count_list = [
    ('itself', 16),
    ('grey', 1),
    ('senses', 4),
    ('repair', 1),
    ('me', 228),
]

def addKey(row):
    return (1, row)

def sortGroupedData(row):
    (keyNumber, sortData) = row
    sortData.sort(key=lambda x: x[1], reverse=True)
    return sortData[0:3]

word_count = (p 
            | 'CreateWordCountColl' >> beam.Create(word_count_list)
            | 'AddKey' >> beam.Map(addKey)
            | 'GroupByKey' >> beam.GroupByKey()
            | 'SortGroupedData' >> beam.Map(sortGroupedData)
            | 'Write' >> WriteToText('./sorting_results')
            )

This returns the top 3 in a single row list.

[('me', 228), ('itself', 16), ('senses', 4)]

However, consider that you would give up on the parallel processing of the dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM