Sorting apache beam wordcount_minimal output

Question

I'm working through beams word count examples (in python). I am able to run the example on DataflowRunner and receive an output.

The output files currently look like:

itself: 16
grey: 1
senses: 4
repair: 1
me: 228

Is there anyway to sort a PCollection so that my output files are sorted in descending order based on word frequency?

In the case that there is no way to do this, what is the standard workflow to find the most frequently occurring words? Would this be handled by a separate process after beam reduces the data down to word counts?

Answer 1

In Beam the elements of a PCollection are unordered. I'd store the results in a database and perform the sorting there.

Not sure about your use case and if it is really necessary to sort within Beam, but a workaround can be grouping all the rows on a fictitious key, use GroupByKey, and perform the sorting on the grouped data, as follows:

word_count_list = [
    ('itself', 16),
    ('grey', 1),
    ('senses', 4),
    ('repair', 1),
    ('me', 228),
]

def addKey(row):
    return (1, row)

def sortGroupedData(row):
    (keyNumber, sortData) = row
    sortData.sort(key=lambda x: x[1], reverse=True)
    return sortData[0:3]

word_count = (p 
            | 'CreateWordCountColl' >> beam.Create(word_count_list)
            | 'AddKey' >> beam.Map(addKey)
            | 'GroupByKey' >> beam.GroupByKey()
            | 'SortGroupedData' >> beam.Map(sortGroupedData)
            | 'Write' >> WriteToText('./sorting_results')
            )

This returns the top 3 in a single row list.

[('me', 228), ('itself', 16), ('senses', 4)]

However, consider that you would give up on the parallel processing of the dataset.

Sorting apache beam wordcount_minimal output

Question

1 answers

solution1
0 ACCPTED 2019-05-07 10:47:25

Sorting apache beam wordcount_minimal output

Question

1 answers

solution1 0 ACCPTED 2019-05-07 10:47:25

solution1
0 ACCPTED 2019-05-07 10:47:25