简体   繁体   中英

How to read a JSON file using Apache beam parDo function in Java

I am new to Apache beam. As per our requirement I need to pass a JSON file containing five to 10 JSON records as input and read this JSON data from the file line by line and store into BigQuery. Can anyone please help me with my sample code below which tries to read JSON data using apache beam:

PCollection<String> lines = 
    pipeline
      .apply("ReadMyFile", 
             TextIO.read()
                   .from("C:\\Users\\Desktop\\test.json")); 
if(null!=lines) { 
  PCollection<String> words =
     lines.apply(ParDo.of(new DoFn<String, String>() { 
        @ProcessElement
        public void processElement(ProcessContext c) { 
          String line = c.element();
        }
      })); 
  pipeline.run(); 
}

The answer is it depends.

TextIO reads the files line-by line. So in your test.json each line needs to contain a separate Json object.

The ParDo you have will then receive those lines one-by one, ie each call to @ProcessElement gets a single line.

Then in your ParDo you can use something like Jackson ObjectMapper to parse the Json from the line (or any other Json parser you're familiar with, but Jackson is widely used, including few places in Beam itself.

Overall the approach to writing a ParDo is this:

  • get the c.element() ;
  • do something to the value of c.element() , eg parse it from json into a java object;
  • send the result of what you did to c.element() to c.output() ;

I would recommend starting by looking at Jackson extension to Beam SDK, it adds PTransforms to do exactly that, see this and this .

Please also take a look at this post, it has some links.

There's also the JsonToRow transform that you can look for similar logic, the difference is that it doesn't parse the Json into a user-defined Java object but into a Beam Row class instead.

Before writing to BQ you need to convert the objects you parsed from Json into BQ rows, which will be another ParDo after your parsing logic, and then actually apply the BQIO as even another step. You can see few examples in BQ test .

Lets assume that we have a json strings in the file as below,

{"col1":"sample-val-1", "col2":1.0}
{"col1":"sample-val-2", "col2":2.0}
{"col1":"sample-val-3", "col2":3.0}
{"col1":"sample-val-4", "col2":4.0}
{"col1":"sample-val-5", "col2":5.0}

In order to store these values from file to BigQuery through DataFlow/Beam, you might have to follow below steps,

  • Define a TableReference to refer the BigQuery table.

  • Define TableFieldSchema for every column you expect to store.

  • Read the file using TextIO.read().

  • Create a DoFn to parse Json string to TableRow format.

  • Commit the TableRow objects using BigQueryIO.

You may refer the below code snippet regarding the above steps,

  • For TableReference and TableFieldSchema creation,

     TableReference tableRef = new TableReference(); tableRef.setProjectId("project-id"); tableRef.setDatasetId("dataset-name"); tableRef.setTableId("table-name"); List<TableFieldSchema> fieldDefs = new ArrayList<>(); fieldDefs.add(new TableFieldSchema().setName("column1").setType("STRING")); fieldDefs.add(new TableFieldSchema().setName("column2").setType("FLOAT")); 
  • For the Pipeline steps,

     Pipeline pipeLine = Pipeline.create(options); pipeLine .apply("ReadMyFile", TextIO.read().from("path-to-json-file")) .apply("MapToTableRow", ParDo.of(new DoFn<String, TableRow>() { @ProcessElement public void processElement(ProcessContext c) { Gson gson = new GsonBuilder().create(); HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class); TableRow row = new TableRow(); row.set("column1", parsedMap.get("col1").toString()); row.set("column2", Double.parseDouble(parsedMap.get("col2").toString())); c.output(row); } })) .apply("CommitToBQTable", BigQueryIO.writeTableRows() .to(tableRef) .withSchema(new TableSchema().setFields(fieldDefs)) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND)); pipeLine.run(); 

The BigQuery table might look as below,

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM