[英]Using CombinePerKey in Google Cloud Dataflow Python
I'm trying to run a simple Dataflow Python pipeline that gets certain user events from BigQuery and produces a per-user event count. 我正在尝试运行一个简单的Dataflow Python管道,该管道从BigQuery获取某些用户事件并生成每个用户的事件计数。
p = df.Pipeline(argv=pipeline_args)
result_query = "..."
data = p | df.io.Read(df.io.BigQuerySource(query=result_query))
user_events = data|df.Map(lambda x: (x['users_user_id'], 1))
user_event_counts = user_events|df.CombinePerKey(sum)
Running this gives me an error: 运行它给我一个错误:
TypeError: Expected tuple, got int [while running 'Map(<lambda at user_stats.py:...>)']
Data before the CombinePerKey
transform is in this form: CombinePerKey
转换之前的数据采用以下形式:
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'2296845644499670', 1)
(u'2296845644499670', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
If instead calculate user_event_counts
with this: 如果改为使用以下方法计算
user_event_counts
:
user_event_counts = (user_events|df.GroupByKey()|
df.Map('count', lambda (user, ones): (user, sum(ones))))
then there are no errors and I get the result I expect. 那么就没有错误,我得到了我期望的结果。
Based the docs I would have expected similar behaviour from both approaches. 根据文档,我期望两种方法都具有相似的行为。 I obviously missing something with respect to
CombinePerKey
but I can't see what it is. 我显然缺少关于
CombinePerKey
东西,但我看不到它是什么。 Any tips appreciated! 任何提示表示赞赏!
I am guessing you run a version of the SDK lower than 0.2.4. 我猜您运行的SDK版本低于0.2.4。 This is a bug in how we handle combining operations in some scenarios.
这是在某些情况下我们如何处理合并操作的错误。 The issue is fixed with the latest release of the SDK (v0.2.4): https://github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4 Sorry about that.
此问题已通过最新版本的SDK(v0.2.4)修复: https : //github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4很抱歉。 Let us know if you still experience the issue with the latest release.
如果您仍然在最新版本中遇到问题,请告诉我们。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.