简体   繁体   English

如何获取多个文件作为 apache 光束输入?

[英]How to get muliple files as apache beam input?

Am working on this scenario: In Google Cloud Storage my files are store in this structure:我正在处理这种情况:在 Google Cloud Storage 中,我的文件存储在这种结构中:

PS*: the 2 files are in the same folder (it was an indent mistake) PS*:这两个文件在同一个文件夹中(这是一个缩进错误)

在此处输入图像描述

what i want to do is:我想做的是:

1] read the 2 files "client_info.csv" + "client_events.csv" from each day 1]每天读取2个文件“client_info.csv”+“client_events.csv”

2] join columns based on a common column inside each file to get 1 pcollection 2]基于每个文件内的公共列连接列以获得1个pcollection

3] doing transformations 3] 进行转换

4] load data to bigquery 4]将数据加载到bigquery

I wrote a code that read only from 1 date and it works well, But i couldn't solve the part of iteration over all dates我写了一个只能从 1 个日期读取的代码,它运行良好,但我无法解决所有日期的迭代部分

if you have any suggestion, please provide it.如果您有任何建议,请提供。

A solution may be to consider a pipeline that merges two branches.一个解决方案可能是考虑合并两个分支的管道。 In each branch you consider one input file separately and then you join them.在每个分支中,您分别考虑一个输入文件,然后加入它们。

Please check out the illustration and the sample code available here在此处查看插图和示例代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM