简体   繁体   English

如何在python中比较具有相同PCollection的两个键的所有值?

[英]How to compare all the values of two keys with in same PCollection in python?

I am new to Apache Beam/dataflow.我是 Apache Beam/数据流的新手。 I am reading a BigQuery table in Apache Beam and I want to do group by two different columns and compare all the values for two different keys.我正在 Apache Beam 中读取 BigQuery 表,我想按两个不同的列进行分组并比较两个不同键的所有值。 I have created a tuple of two different columns (ID, Date) that acts as Key.我创建了一个由两个不同列(ID、Date)组成的元组,用作 Key。 Below is the sample data in a table以下是表格中的示例数据

  ID         Date        P_id    position
  "abc"    2019-08-01   "rt56"      5
  "abc"    2019-08-01   "rt57"      6
  "abc"    2019-08-01   "rt58"      7
  "abc"    2019-08-02   "rt56"      2 
  "abc"    2019-08-02   "rt57"      4
  "abc"    2019-08-02   "rt58"      7

Now I want to compare the position of the P_ids for pair ("abc", 2019-08-01) and ("abc", 2019-08-02) and see if any of the P_id position is changed then add another column in the table "status" with True.现在我想比较 P_ids 对 ("abc", 2019-08-01) 和 ("abc", 2019-08-02) 的位置,看看是否有任何 P_id 位置发生变化,然后在表“状态”为 True。 So my new table should be like below所以我的新表应该如下所示

I am trying it with below code我正在尝试使用以下代码

  ID         Date        P_id    position  Status
  "abc"    2019-08-01   "rt56"      5       False (as this is first date)
  "abc"    2019-08-01   "rt57"      6
  "abc"    2019-08-01   "rt58"      7
  "abc"    2019-08-02   "rt56"      2       True
  "abc"    2019-08-02   "rt57"      4
  "abc"    2019-08-02   "rt58"      7
(
p 
| "get_key_tuple" >> beam.ParDo(lambda element: tuple(element["Id"], element["Date]))
| "group_by" >> beam.GroupByKey()
| "compare_and_add_status" >> beam.ParDo(compare_pos)
)

But I don't know how should I proceed for the function compare_pos()但我不知道我应该如何处理函数 compare_pos()

It would be very helpful to get some ideas on how can I efficiently compare the position and create a new column to know the status considering I have a very large table and lots of ID's.考虑到我有一个非常大的表格和很多 ID,获得一些关于如何有效地比较位置并创建一个新列以了解状态的想法将非常有帮助。

Beam's GroupByKey takes a PCollection of 2-tuples and returns a PCollection where every element is a 2-tuple of the key and an (unordered) iterable of all values that were associated with that key. Beam 的 GroupByKey 采用 2 元组的 PCollection 并返回一个 PCollection,其中每个元素都是键的 2 元组和与该键关联的所有值的(无序)可迭代对象。 For example, if your original collection had the elements例如,如果您的原始集合具有元素

(k1, v1)
(k1, v2)
(k1, v3)
(k2, v4)

the result of GroupByKey would be a PCollection with the elements like GroupByKey 的结果将是一个带有类似元素的 PCollection

(k1, [v1, v3, v2])
(k2, [v4])

In your case, your keys and values themselves are tuples.在您的情况下,您的键和值本身就是元组。 So you could take your original collection and apply a Map(lambda elt: ((elt['Id'], elt['Date']), (elt['P_id'], elt['position']))) which would give you a PCollection with elements因此,您可以使用原始集合并应用Map(lambda elt: ((elt['Id'], elt['Date']), (elt['P_id'], elt['position'])))会给你一个带有元素的 PCollection

  ("abc", 2019-08-01),   ("rt56", 5)
  ("abc", 2019-08-01),   ("rt57", 6)
  ("abc", 2019-08-01),   ("rt58", 7)
  ("abc", 2019-08-02),   ("rt56", 2)
  ("abc", 2019-08-02),   ("rt57", 4)
  ("abc", 2019-08-02),   ("rt58", 7)

which, upon applying GroupByKey would become其中,在应用 GroupByKey 后将成为

  ("abc", 2019-08-01),   [("rt56", 5), ("rt57", 6), ("rt58", 7)]
  ("abc", 2019-08-02),   [("rt56", 2), ("rt57", 4), ("rt58", 7)]

At this point your compare_pos function could inspect all the P_id, position tuples corresponding to a given ID, Date pair and perform whatever logic is needed to emit what needs to be changed (with its corresponding key).此时,您的compare_pos函数可以检查与给定ID, Date对相对应的所有P_id, position元组ID, Date并执行发出需要更改的内容(及其相应的键)所需的任何逻辑。

I might be interpreting OP wrong, but if the suggestion by @robertwb doesn't work, try perhaps grouping by the following instead:我可能对 OP 的解释有误,但如果 @robertwb 的建议不起作用,请尝试按以下方式分组:

| "Create k, v tuple" >> beam.Map(
                    lambda elem: ((elem["P_id"], elem["ID"]), [elem["Date"], elem["position"]]))
| "Group by key" >> beam.GroupByKey()

Which will output the following structure:这将输出以下结构:

(('rt56', 'abc'), [['2019-08-01', 5], ['2019-08-02', 2]])
(('rt57', 'abc'), [['2019-08-01', 6], ['2019-08-02', 4]])
(('rt58', 'abc'), [['2019-08-01', 7], ['2019-08-02', 7]])

Which should allow you to compare each element in the resulting PCollection individually, instead of cross-comparing across elements in the PCollection.这应该允许您单独比较生成的 PCollection 中的每个元素,而不是在 PCollection 中的元素之间进行交叉比较。 This should probably fit the execution model of Beam better if I'm correct.如果我是对的,这应该更适合 Beam 的执行模型。

This is based on my assumption that you want to check if the position for a given P_id has changed between two dates.这是基于我的假设,即您想要检查给定 P_id 的位置是否在两个日期之间发生了变化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何比较具有相同键的两个字典并使用 python 中的条件更新值? - How to compare two dictionaries with the same keys and update values with a condition in python? 如何比较具有相同键的两个嵌套字典并使用 python 中的条件更新值? - How to compare two nested dictionaries with the same keys and update values with a condition in python? Python词典:将键与所有键的值进行比较 - Python Dictionary: Compare keys with Values of All Keys 如何比较来自两个不同字典的相同键值 - How to compare same keys values from two different dict 如何比较 python 中具有不同键的两个字典的值 - How to compare values of two dictionaries that have different keys in python 比较两个字典列表,并在python中创建一个包含所有键和值的新列表 - Compare two list of dictionaries and create a new list with all keys and values in python 如何在 python 中组合(附加值)两个具有相同键的嵌套字典? - How to combine (append values) two nested dictionaries with the same keys in python? 如何比较 python 中两个值是否相同但不同的情况 - how to compare if two values are the same but different cases in python 如何在Python中比较两个不同字典的键? - How to compare keys of two different dictionaries in Python? 如何在Python中比较两个字典键? - How to Compare Two Dictionary Keys in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM