Using CombinePerKey in Google Cloud Dataflow Python

Question

I'm trying to run a simple Dataflow Python pipeline that gets certain user events from BigQuery and produces a per-user event count.

p = df.Pipeline(argv=pipeline_args)
result_query = "..."
data = p | df.io.Read(df.io.BigQuerySource(query=result_query))
user_events = data|df.Map(lambda x: (x['users_user_id'], 1))
user_event_counts = user_events|df.CombinePerKey(sum)

Running this gives me an error:

TypeError: Expected tuple, got int [while running 'Map(<lambda at user_stats.py:...>)']

Data before the CombinePerKey transform is in this form:

(u'55107178236374', 1)
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'2296845644499670', 1)
(u'2296845644499670', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)

If instead calculate user_event_counts with this:

user_event_counts = (user_events|df.GroupByKey()|
    df.Map('count', lambda (user, ones): (user, sum(ones))))

then there are no errors and I get the result I expect.

Based the docs I would have expected similar behaviour from both approaches. I obviously missing something with respect to CombinePerKey but I can't see what it is. Any tips appreciated!

Answer 1

I am guessing you run a version of the SDK lower than 0.2.4. This is a bug in how we handle combining operations in some scenarios. The issue is fixed with the latest release of the SDK (v0.2.4): https://github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4 Sorry about that. Let us know if you still experience the issue with the latest release.

Using CombinePerKey in Google Cloud Dataflow Python

Question

1 answers

solution1
1 2016-05-17 16:26:01

Using CombinePerKey in Google Cloud Dataflow Python

Question

1 answers

solution1 1 2016-05-17 16:26:01

solution1
1 2016-05-17 16:26:01