简体   繁体   English

在Google Cloud Dataflow Python中使用CombinePerKey

[英]Using CombinePerKey in Google Cloud Dataflow Python

I'm trying to run a simple Dataflow Python pipeline that gets certain user events from BigQuery and produces a per-user event count. 我正在尝试运行一个简单的Dataflow Python管道,该管道从BigQuery获取某些用户事件并生成每个用户的事件计数。

p = df.Pipeline(argv=pipeline_args)
result_query = "..."
data = p | df.io.Read(df.io.BigQuerySource(query=result_query))
user_events = data|df.Map(lambda x: (x['users_user_id'], 1))
user_event_counts = user_events|df.CombinePerKey(sum)

Running this gives me an error: 运行它给我一个错误:

TypeError: Expected tuple, got int [while running 'Map(<lambda at user_stats.py:...>)']

Data before the CombinePerKey transform is in this form: CombinePerKey转换之前的数据采用以下形式:

(u'55107178236374', 1)
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'2296845644499670', 1)
(u'2296845644499670', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)

If instead calculate user_event_counts with this: 如果改为使用以下方法计算user_event_counts

user_event_counts = (user_events|df.GroupByKey()|
    df.Map('count', lambda (user, ones): (user, sum(ones))))

then there are no errors and I get the result I expect. 那么就没有错误,我得到了我期望的结果。

Based the docs I would have expected similar behaviour from both approaches. 根据文档,我期望两种方法都具有相似的行为。 I obviously missing something with respect to CombinePerKey but I can't see what it is. 我显然缺少关于CombinePerKey东西,但我看不到它是什么。 Any tips appreciated! 任何提示表示赞赏!

I am guessing you run a version of the SDK lower than 0.2.4. 我猜您运行的SDK版本低于0.2.4。 This is a bug in how we handle combining operations in some scenarios. 这是在某些情况下我们如何处理合并操作的错误。 The issue is fixed with the latest release of the SDK (v0.2.4): https://github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4 Sorry about that. 此问题已通过最新版本的SDK(v0.2.4)修复: https : //github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4很抱歉。 Let us know if you still experience the issue with the latest release. 如果您仍然在最新版本中遇到问题,请告诉我们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM