简体   繁体   中英

How to merge two files and then view the PCollection (Apache Beam)

I have two csv files which need to be merged on a common column using beam (Python SDK). The files look like below:

users_v.csv

user_id,name,gender,age,address,date_joined
1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13
2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06

orders_v.csv

order_no,user_id,product_list,date_purchased
1000,1887,Cassava,2000-01-01
1001,838,"Calabash, Water Spinach",2000-01-01

I have tried the following which appears to work (no errors) but I am unable to view the resulting PCollection using beam.Map(print) :

import apache_beam as beam

with beam.Pipeline() as pipeline:
  orders = p | "Read orders" >> beam.io.ReadFromText("orders_v.csv")
  users = p | "Read users" >> beam.io.ReadFromText("users_v.csv")
  {"orders": orders, "users": users} | beam.CoGroupByKey() | beam.Map(print)

How can I print out the resulting PCollection?

There a few mistakes in the code:

1 - You are using pipeline in the with , but then you use p as pipeline variable

2 - The dictionary before the CoGroupByKey determines the names of the cogrouped variables, but it does still needs a Key Value to join

3 - I guess you want to want to skip the headers.

The code should look something like this. The function split_by_kv is far from perfect and you'd need to improve it so that it retrieves the keys better (since some of your fields may contain , ).

def split_by_kv(element, index, delimiter=", "):
    # Need a better approach here
    splitted = element.split(delimiter)
    return splitted[index], element


with beam.Pipeline() as p:
    orders = (p | "Read orders" >> ReadFromText("files/orders_v.csv", skip_header_lines=1)
                | "to KV order" >> Map(split_by_kv, index=1, delimiter=",")
             )
    
    users = (p | "Read users" >> ReadFromText("files/users_v.csv", skip_header_lines=1)
               | "to KV users" >> Map(split_by_kv, index=0, delimiter=",")
            )
    
    ({"orders": orders, "users": users} | CoGroupByKey() 
                                        | Map(print)
    )

output is (key, {"orders": values for key, "users": values for users})

('1887', {'orders': ['1000,1887,Cassava,2000-01-01'], 'users': []})
('838', {'orders': ['1001,838,"Calabash, Water Spinach",2000-01-01'], 'users': []})
('1', {'orders': [], 'users': ['1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13']})
('2', {'orders': [], 'users': ['2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06']})

Also, you may want to have a look at the newDataFrames API

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM