简体   繁体   中英

How to compare all the values of two keys with in same PCollection in python?

I am new to Apache Beam/dataflow. I am reading a BigQuery table in Apache Beam and I want to do group by two different columns and compare all the values for two different keys. I have created a tuple of two different columns (ID, Date) that acts as Key. Below is the sample data in a table

  ID         Date        P_id    position
  "abc"    2019-08-01   "rt56"      5
  "abc"    2019-08-01   "rt57"      6
  "abc"    2019-08-01   "rt58"      7
  "abc"    2019-08-02   "rt56"      2 
  "abc"    2019-08-02   "rt57"      4
  "abc"    2019-08-02   "rt58"      7

Now I want to compare the position of the P_ids for pair ("abc", 2019-08-01) and ("abc", 2019-08-02) and see if any of the P_id position is changed then add another column in the table "status" with True. So my new table should be like below

I am trying it with below code

  ID         Date        P_id    position  Status
  "abc"    2019-08-01   "rt56"      5       False (as this is first date)
  "abc"    2019-08-01   "rt57"      6
  "abc"    2019-08-01   "rt58"      7
  "abc"    2019-08-02   "rt56"      2       True
  "abc"    2019-08-02   "rt57"      4
  "abc"    2019-08-02   "rt58"      7
(
p 
| "get_key_tuple" >> beam.ParDo(lambda element: tuple(element["Id"], element["Date]))
| "group_by" >> beam.GroupByKey()
| "compare_and_add_status" >> beam.ParDo(compare_pos)
)

But I don't know how should I proceed for the function compare_pos()

It would be very helpful to get some ideas on how can I efficiently compare the position and create a new column to know the status considering I have a very large table and lots of ID's.

Beam's GroupByKey takes a PCollection of 2-tuples and returns a PCollection where every element is a 2-tuple of the key and an (unordered) iterable of all values that were associated with that key. For example, if your original collection had the elements

(k1, v1)
(k1, v2)
(k1, v3)
(k2, v4)

the result of GroupByKey would be a PCollection with the elements like

(k1, [v1, v3, v2])
(k2, [v4])

In your case, your keys and values themselves are tuples. So you could take your original collection and apply a Map(lambda elt: ((elt['Id'], elt['Date']), (elt['P_id'], elt['position']))) which would give you a PCollection with elements

  ("abc", 2019-08-01),   ("rt56", 5)
  ("abc", 2019-08-01),   ("rt57", 6)
  ("abc", 2019-08-01),   ("rt58", 7)
  ("abc", 2019-08-02),   ("rt56", 2)
  ("abc", 2019-08-02),   ("rt57", 4)
  ("abc", 2019-08-02),   ("rt58", 7)

which, upon applying GroupByKey would become

  ("abc", 2019-08-01),   [("rt56", 5), ("rt57", 6), ("rt58", 7)]
  ("abc", 2019-08-02),   [("rt56", 2), ("rt57", 4), ("rt58", 7)]

At this point your compare_pos function could inspect all the P_id, position tuples corresponding to a given ID, Date pair and perform whatever logic is needed to emit what needs to be changed (with its corresponding key).

I might be interpreting OP wrong, but if the suggestion by @robertwb doesn't work, try perhaps grouping by the following instead:

| "Create k, v tuple" >> beam.Map(
                    lambda elem: ((elem["P_id"], elem["ID"]), [elem["Date"], elem["position"]]))
| "Group by key" >> beam.GroupByKey()

Which will output the following structure:

(('rt56', 'abc'), [['2019-08-01', 5], ['2019-08-02', 2]])
(('rt57', 'abc'), [['2019-08-01', 6], ['2019-08-02', 4]])
(('rt58', 'abc'), [['2019-08-01', 7], ['2019-08-02', 7]])

Which should allow you to compare each element in the resulting PCollection individually, instead of cross-comparing across elements in the PCollection. This should probably fit the execution model of Beam better if I'm correct.

This is based on my assumption that you want to check if the position for a given P_id has changed between two dates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM