繁体   English   中英

Java Apache Beam PCollections 以及如何使它们工作?

[英]Java Apache Beam PCollections and how to make them work?

首先让我描述一下场景。

步骤 1. 我必须逐行读取文件。 该文件是一个 .json 文件,每一行的格式如下:

{
"schema":{Several keys that are to be deleted},
"payload":{"key1":20001,"key2":"aaaa","key3":"bbbb","key4":"USD","key5":"100"}
}

步骤 2. 删除架构对象并最终得到(为了接下来的步骤添加了更多示例):

{"key1":20001,"key2":"aaaa","key3":"bbbb","key4":"USD","key5":"100"}
{"key1":20001,"key2":"aaaa","key3":"bbbb","key4":"US","key5":"90"}
{"key1":2002,"key2":"cccc","key3":"hhhh","key4":"CN","key5":"80"}

步骤 3. 将这些值拆分为键和值,方法是在内存中将它们设为 json 并使用字符串作为键和值与映射

{"key1":20001,"key2":"aaaa","key3":"bbbb"} = {"key4":"USD","key5":"100"}
{"key1":20001,"key2":"aaaa","key3":"bbbb"} = {"key4":"US","key5":"90"}
{"key1":2002,"key2":"cccc","key3":"hhhh"} = {"key4":"CN","key5":"80"}

第 4 步,由于我对 Pcollections 缺乏了解而无法完成的第 4 步。 我需要抓取所有读取的行并执行 GroupByKey 以便它最终像:

{"key1":20001,"key2":"aaaa","key3":"bbbb"} = [ 
                                        {"key4":"USD","key5":"100"},
                                        {"key4":"US","key5":"90"}    ]
{"key1":2002,"key2":"cccc","key3":"hhhh"} = {"key4":"CN","key5":"80"}

现在我的代码如下所示:

static void runSimplePipeline(PipelineOptionsCustom options) {
            Pipeline p = Pipeline.create(options);

            p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
                .apply("TransformData", ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) { 
                        Gson gson = new GsonBuilder().create();
                        ObjectMapper oMapper = new ObjectMapper();
                        JSONObject obj_key = new JSONObject();
                        JSONObject obj_value = new JSONObject();
                        List<String> listMainKeys = Arrays.asList(new String[]{"Key1", "Key2", "Key3"});


                        HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class);
                        parsedMap.remove("schema");

                        Map<String, String> map = oMapper.convertValue(parsedMap.get("payload"), Map.class);
                        for (Map.Entry<String,String> entry : map.entrySet()) {
                            if (listMainKeys.contains(entry.getKey())) {
                                obj_key.put(entry.getKey(),entry.getValue());
                            } else {
                                obj_value.put(entry.getKey(),entry.getValue());
                            }

                        }
                        KV objectKV = KV.of(obj_key.toJSONString(), obj_value.toJSONString());

                        System.out.print(obj_key.toString() + " : " + obj_value.toString() +"\n");

                    }
                })); <------- RIGHT HERE

            p.run().waitUntilFinish();
          }

现在显而易见的部分是,在它说“就在此处”的地方,我应该对 CountByKey 进行另一个申请,但是这需要完整的 PCollection,而这正是我不真正理解的。

这是代码,感谢 Guillem Xercavins 的链接 Github:

static void runSimplePipeline(PipelineOptionsCustom options) {
    Pipeline p = Pipeline.create(options);

    PCollection<Void> results = p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
            .apply("TransformData", ParDo.of(new DoFn<String, KV<String, String>>() {
                @ProcessElement
                public void processElement(ProcessContext c) {
                    Gson gson = new GsonBuilder().create();
                    ObjectMapper oMapper = new ObjectMapper();
                    JSONObject obj_key = new JSONObject();
                    JSONObject obj_value = new JSONObject();
                    List<String> listMainKeys = Arrays
                            .asList(new String[] { "EBELN", "AEDAT", "BATXT", "EKOTX", "Land1", "WAERS" });

                    HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class);
                    parsedMap.remove("schema");

                    Map<String, String> map = oMapper.convertValue(parsedMap.get("payload"), Map.class);
                    for (Map.Entry<String, String> entry : map.entrySet()) {
                        if (listMainKeys.contains(entry.getKey())) {
                            obj_key.put(entry.getKey(), entry.getValue());
                        } else {
                            obj_value.put(entry.getKey(), entry.getValue());
                        }

                    }
                    KV objectKV = KV.of(obj_key.toJSONString(), obj_value.toJSONString());
                    c.output(objectKV);

                }
            })).apply("Group By Key", GroupByKey.<String, String>create())
            .apply("Continue Processing", ParDo.of(new DoFn<KV<String, Iterable<String>>, Void>() {
                @ProcessElement
                public void processElement(ProcessContext c) {
                    System.out.print(c.element());
                }
            }));

    p.run().waitUntilFinish();
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM