[英]How to compare all the values of two keys with in same PCollection in python?
I am new to Apache Beam/dataflow.我是 Apache Beam/数据流的新手。 I am reading a BigQuery table in Apache Beam and I want to do group by two different columns and compare all the values for two different keys.我正在 Apache Beam 中读取 BigQuery 表,我想按两个不同的列进行分组并比较两个不同键的所有值。 I have created a tuple of two different columns (ID, Date) that acts as Key.我创建了一个由两个不同列(ID、Date)组成的元组,用作 Key。 Below is the sample data in a table以下是表格中的示例数据
ID Date P_id position
"abc" 2019-08-01 "rt56" 5
"abc" 2019-08-01 "rt57" 6
"abc" 2019-08-01 "rt58" 7
"abc" 2019-08-02 "rt56" 2
"abc" 2019-08-02 "rt57" 4
"abc" 2019-08-02 "rt58" 7
Now I want to compare the position of the P_ids for pair ("abc", 2019-08-01) and ("abc", 2019-08-02) and see if any of the P_id position is changed then add another column in the table "status" with True.现在我想比较 P_ids 对 ("abc", 2019-08-01) 和 ("abc", 2019-08-02) 的位置,看看是否有任何 P_id 位置发生变化,然后在表“状态”为 True。 So my new table should be like below所以我的新表应该如下所示
I am trying it with below code我正在尝试使用以下代码
ID Date P_id position Status
"abc" 2019-08-01 "rt56" 5 False (as this is first date)
"abc" 2019-08-01 "rt57" 6
"abc" 2019-08-01 "rt58" 7
"abc" 2019-08-02 "rt56" 2 True
"abc" 2019-08-02 "rt57" 4
"abc" 2019-08-02 "rt58" 7
(
p
| "get_key_tuple" >> beam.ParDo(lambda element: tuple(element["Id"], element["Date]))
| "group_by" >> beam.GroupByKey()
| "compare_and_add_status" >> beam.ParDo(compare_pos)
)
But I don't know how should I proceed for the function compare_pos()但我不知道我应该如何处理函数 compare_pos()
It would be very helpful to get some ideas on how can I efficiently compare the position and create a new column to know the status considering I have a very large table and lots of ID's.考虑到我有一个非常大的表格和很多 ID,获得一些关于如何有效地比较位置并创建一个新列以了解状态的想法将非常有帮助。
Beam's GroupByKey takes a PCollection of 2-tuples and returns a PCollection where every element is a 2-tuple of the key and an (unordered) iterable of all values that were associated with that key. Beam 的 GroupByKey 采用 2 元组的 PCollection 并返回一个 PCollection,其中每个元素都是键的 2 元组和与该键关联的所有值的(无序)可迭代对象。 For example, if your original collection had the elements例如,如果您的原始集合具有元素
(k1, v1)
(k1, v2)
(k1, v3)
(k2, v4)
the result of GroupByKey would be a PCollection with the elements like GroupByKey 的结果将是一个带有类似元素的 PCollection
(k1, [v1, v3, v2])
(k2, [v4])
In your case, your keys and values themselves are tuples.在您的情况下,您的键和值本身就是元组。 So you could take your original collection and apply a Map(lambda elt: ((elt['Id'], elt['Date']), (elt['P_id'], elt['position'])))
which would give you a PCollection with elements因此,您可以使用原始集合并应用Map(lambda elt: ((elt['Id'], elt['Date']), (elt['P_id'], elt['position'])))
会给你一个带有元素的 PCollection
("abc", 2019-08-01), ("rt56", 5)
("abc", 2019-08-01), ("rt57", 6)
("abc", 2019-08-01), ("rt58", 7)
("abc", 2019-08-02), ("rt56", 2)
("abc", 2019-08-02), ("rt57", 4)
("abc", 2019-08-02), ("rt58", 7)
which, upon applying GroupByKey would become其中,在应用 GroupByKey 后将成为
("abc", 2019-08-01), [("rt56", 5), ("rt57", 6), ("rt58", 7)]
("abc", 2019-08-02), [("rt56", 2), ("rt57", 4), ("rt58", 7)]
At this point your compare_pos
function could inspect all the P_id, position
tuples corresponding to a given ID, Date
pair and perform whatever logic is needed to emit what needs to be changed (with its corresponding key).此时,您的compare_pos
函数可以检查与给定ID, Date
对相对应的所有P_id, position
元组ID, Date
并执行发出需要更改的内容(及其相应的键)所需的任何逻辑。
I might be interpreting OP wrong, but if the suggestion by @robertwb doesn't work, try perhaps grouping by the following instead:我可能对 OP 的解释有误,但如果 @robertwb 的建议不起作用,请尝试按以下方式分组:
| "Create k, v tuple" >> beam.Map(
lambda elem: ((elem["P_id"], elem["ID"]), [elem["Date"], elem["position"]]))
| "Group by key" >> beam.GroupByKey()
Which will output the following structure:这将输出以下结构:
(('rt56', 'abc'), [['2019-08-01', 5], ['2019-08-02', 2]])
(('rt57', 'abc'), [['2019-08-01', 6], ['2019-08-02', 4]])
(('rt58', 'abc'), [['2019-08-01', 7], ['2019-08-02', 7]])
Which should allow you to compare each element in the resulting PCollection individually, instead of cross-comparing across elements in the PCollection.这应该允许您单独比较生成的 PCollection 中的每个元素,而不是在 PCollection 中的元素之间进行交叉比较。 This should probably fit the execution model of Beam better if I'm correct.如果我是对的,这应该更适合 Beam 的执行模型。
This is based on my assumption that you want to check if the position for a given P_id has changed between two dates.这是基于我的假设,即您想要检查给定 P_id 的位置是否在两个日期之间发生了变化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.