简体   繁体   中英

How to use ParseJsons in Apache Beam / Google Dataflow?

java newbie here. I'm struggling to understand how to use ParseJsons in my Apache Beam pipeline to parse a string PCollection into an object PCollection.

My understanding is that I need to first define a class that matches the json structure, and then use ParseJsons to map the json strings into objects of that class.

However, the ParseJsons documentation looks cryptic to me. I'm not sure how to actually perform the transform using Apache Beam. Could someone give me a quick and dirty example of how to parse line delimited json strings?

Here's one of the attempts I've made, but unfortunately the syntax is incorrect.

class Product {
  private String name = null;
  private String url = null;
}

p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
 .apply(new ParseJsons.of(Product))
 .apply("WriteCounts", TextIO.write().to(options.getOutput()));

I think you want:

PCollectoion<Product> = 
  p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
   .apply(new ParseJsons.of(Product.class))
   .setCoder(SerializableCoder.of(MyPojo.class));

The ParseJsons.of method is static. So you can just call it without instantiating the class. However, you will need to convert the the result back to String. Example:

PCollection<MyPojo> = 
   p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
    .apply("Parse JSON", ParseJsons.of(MyPojo.class))
    .apply("Convert back to String", ParDo.of(new FormatPojoFn()))
    .apply("WriteCounts", TextIO.write().to(options.getOutput()));

You could also try using the writeCustomType method on the TextIO class :

p.apply(TextIO.<UserEvent>writeCustomType(new FormatEvent()).to(...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM