繁体   English   中英

如何合并两个文件然后查看 PCollection (Apache Beam)

[英]How to merge two files and then view the PCollection (Apache Beam)

我有两个 csv 文件需要使用梁(Python SDK)合并到一个公共列上。 文件如下所示:

users_v.csv

user_id,name,gender,age,address,date_joined
1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13
2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06

订单_v.csv

order_no,user_id,product_list,date_purchased
1000,1887,Cassava,2000-01-01
1001,838,"Calabash, Water Spinach",2000-01-01

我尝试了以下似乎可行的方法(没有错误),但我无法使用beam.Map(print)查看生成的 PCollection :

import apache_beam as beam

with beam.Pipeline() as pipeline:
  orders = p | "Read orders" >> beam.io.ReadFromText("orders_v.csv")
  users = p | "Read users" >> beam.io.ReadFromText("users_v.csv")
  {"orders": orders, "users": users} | beam.CoGroupByKey() | beam.Map(print)

如何打印生成的 PCollection?

代码中有几个错误:

1 - 您在with中使用pipeline ,但随后使用p作为管道变量

2 - CoGroupByKey之前的字典确定了 cogrouped 变量的名称,但它仍然需要一个键值来加入

3 - 我猜你想跳过标题。

代码应该看起来像这样。 function split_by_kv远非完美,您需要对其进行改进,以便更好地检索密钥(因为您的某些字段可能包含, )。

def split_by_kv(element, index, delimiter=", "):
    # Need a better approach here
    splitted = element.split(delimiter)
    return splitted[index], element


with beam.Pipeline() as p:
    orders = (p | "Read orders" >> ReadFromText("files/orders_v.csv", skip_header_lines=1)
                | "to KV order" >> Map(split_by_kv, index=1, delimiter=",")
             )
    
    users = (p | "Read users" >> ReadFromText("files/users_v.csv", skip_header_lines=1)
               | "to KV users" >> Map(split_by_kv, index=0, delimiter=",")
            )
    
    ({"orders": orders, "users": users} | CoGroupByKey() 
                                        | Map(print)
    )

output 是 (key, {"orders": values for key, "users": values for users})

('1887', {'orders': ['1000,1887,Cassava,2000-01-01'], 'users': []})
('838', {'orders': ['1001,838,"Calabash, Water Spinach",2000-01-01'], 'users': []})
('1', {'orders': [], 'users': ['1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13']})
('2', {'orders': [], 'users': ['2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06']})

此外,您可能想看看新的DataFrames API

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM