I have two csv files which need to be merged on a common column using beam (Python SDK). The files look like below:
users_v.csv
user_id,name,gender,age,address,date_joined
1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13
2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06
orders_v.csv
order_no,user_id,product_list,date_purchased
1000,1887,Cassava,2000-01-01
1001,838,"Calabash, Water Spinach",2000-01-01
I have tried the following which appears to work (no errors) but I am unable to view the resulting PCollection using beam.Map(print)
:
import apache_beam as beam
with beam.Pipeline() as pipeline:
orders = p | "Read orders" >> beam.io.ReadFromText("orders_v.csv")
users = p | "Read users" >> beam.io.ReadFromText("users_v.csv")
{"orders": orders, "users": users} | beam.CoGroupByKey() | beam.Map(print)
How can I print out the resulting PCollection?
There a few mistakes in the code:
1 - You are using pipeline
in the with
, but then you use p
as pipeline variable
2 - The dictionary before the CoGroupByKey
determines the names of the cogrouped variables, but it does still needs a Key Value to join
3 - I guess you want to want to skip the headers.
The code should look something like this. The function split_by_kv
is far from perfect and you'd need to improve it so that it retrieves the keys better (since some of your fields may contain ,
).
def split_by_kv(element, index, delimiter=", "):
# Need a better approach here
splitted = element.split(delimiter)
return splitted[index], element
with beam.Pipeline() as p:
orders = (p | "Read orders" >> ReadFromText("files/orders_v.csv", skip_header_lines=1)
| "to KV order" >> Map(split_by_kv, index=1, delimiter=",")
)
users = (p | "Read users" >> ReadFromText("files/users_v.csv", skip_header_lines=1)
| "to KV users" >> Map(split_by_kv, index=0, delimiter=",")
)
({"orders": orders, "users": users} | CoGroupByKey()
| Map(print)
)
output is (key, {"orders": values for key, "users": values for users})
('1887', {'orders': ['1000,1887,Cassava,2000-01-01'], 'users': []})
('838', {'orders': ['1001,838,"Calabash, Water Spinach",2000-01-01'], 'users': []})
('1', {'orders': [], 'users': ['1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13']})
('2', {'orders': [], 'users': ['2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06']})
Also, you may want to have a look at the newDataFrames API
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.