如何在Java中使用Apache Beam ParDo函数读取JSON文件

Question

I am new to Apache beam. 我是Apache Beam的新手。 As per our requirement I need to pass a JSON file containing five to 10 JSON records as input and read this JSON data from the file line by line and store into BigQuery. 根据我们的要求，我需要传递一个包含5至10条JSON记录的JSON文件作为输入，并逐行从文件中读取此JSON数据并存储到BigQuery中。 Can anyone please help me with my sample code below which tries to read JSON data using apache beam: 谁能帮我提供下面的示例代码，该示例代码尝试使用apache beam读取JSON数据：

PCollection<String> lines = 
    pipeline
      .apply("ReadMyFile", 
             TextIO.read()
                   .from("C:\\Users\\Desktop\\test.json")); 
if(null!=lines) { 
  PCollection<String> words =
     lines.apply(ParDo.of(new DoFn<String, String>() { 
        @ProcessElement
        public void processElement(ProcessContext c) { 
          String line = c.element();
        }
      })); 
  pipeline.run(); 
}

Answer 1

The answer is it depends. 答案是取决于。

TextIO reads the files line-by line. TextIO读取文件。 So in your test.json each line needs to contain a separate Json object. 因此，在test.json每一行都需要包含一个单独的Json对象。

The ParDo you have will then receive those lines one-by one, ie each call to @ProcessElement gets a single line. 然后，您拥有的ParDo将一一收到这些行，即，对@ProcessElement每次调用都将获得一行。

Then in your ParDo you can use something like Jackson ObjectMapper to parse the Json from the line (or any other Json parser you're familiar with, but Jackson is widely used, including few places in Beam itself. 然后在你的ParDo ，你可以使用像杰克逊ObjectMapper从线（或任何其他JSON分析器你熟悉解析JSON，但杰克逊被广泛使用，其中包括梁本身几个地方。

Overall the approach to writing a ParDo is this: 总体而言，编写ParDo的方法是：

get the c.element() ; 得到c.element() ;
do something to the value of c.element() , eg parse it from json into a java object; 对c.element()的值做一些事情，例如将其从json解析为java对象；
send the result of what you did to c.element() to c.output() ; 将您对c.element()所做的结果发送到c.output() ;

I would recommend starting by looking at Jackson extension to Beam SDK, it adds PTransforms to do exactly that, see this and this . 我建议从Beam SDK的Jackson扩展开始，它添加了PTransforms来做到这一点，请参见this和this 。

Please also take a look at this post, it has some links. 也请看看这篇文章，它有一些链接。

There's also the JsonToRow transform that you can look for similar logic, the difference is that it doesn't parse the Json into a user-defined Java object but into a Beam Row class instead. 您还可以使用JsonToRow转换来寻找类似的逻辑，不同之处在于它不是将Json解析为用户定义的Java对象，而是解析为Beam Row类。

Before writing to BQ you need to convert the objects you parsed from Json into BQ rows, which will be another ParDo after your parsing logic, and then actually apply the BQIO as even another step. 在写入BQ之前，您需要将从Json解析的对象转换为BQ行，这将是解析逻辑之后的另一个ParDo ，然后实际应用BQIO作为下一步。 You can see few examples in BQ test . 您可以在BQ测试中看到一些示例。

Answer 2

Lets assume that we have a json strings in the file as below, 假设我们在文件中有一个json字符串，如下所示，

{"col1":"sample-val-1", "col2":1.0}
{"col1":"sample-val-2", "col2":2.0}
{"col1":"sample-val-3", "col2":3.0}
{"col1":"sample-val-4", "col2":4.0}
{"col1":"sample-val-5", "col2":5.0}

In order to store these values from file to BigQuery through DataFlow/Beam, you might have to follow below steps, 为了通过DataFlow / Beam将这些值从文件存储到BigQuery，您可能必须执行以下步骤，

Define a TableReference to refer the BigQuery table. 定义一个TableReference来引用BigQuery表。
Define TableFieldSchema for every column you expect to store. 为您希望存储的每一列定义TableFieldSchema。
Read the file using TextIO.read(). 使用TextIO.read（）读取文件。
Create a DoFn to parse Json string to TableRow format. 创建一个DoFn将Json字符串解析为TableRow格式。
Commit the TableRow objects using BigQueryIO. 使用BigQueryIO提交TableRow对象。

You may refer the below code snippet regarding the above steps, 您可以参考以下有关上述步骤的代码段，

For TableReference and TableFieldSchema creation, 对于TableReference和TableFieldSchema创建，

 TableReference tableRef = new TableReference(); tableRef.setProjectId("project-id"); tableRef.setDatasetId("dataset-name"); tableRef.setTableId("table-name"); List<TableFieldSchema> fieldDefs = new ArrayList<>(); fieldDefs.add(new TableFieldSchema().setName("column1").setType("STRING")); fieldDefs.add(new TableFieldSchema().setName("column2").setType("FLOAT"));

For the Pipeline steps, 对于管道步骤，

 Pipeline pipeLine = Pipeline.create(options); pipeLine .apply("ReadMyFile", TextIO.read().from("path-to-json-file")) .apply("MapToTableRow", ParDo.of(new DoFn<String, TableRow>() { @ProcessElement public void processElement(ProcessContext c) { Gson gson = new GsonBuilder().create(); HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class); TableRow row = new TableRow(); row.set("column1", parsedMap.get("col1").toString()); row.set("column2", Double.parseDouble(parsedMap.get("col2").toString())); c.output(row); } })) .apply("CommitToBQTable", BigQueryIO.writeTableRows() .to(tableRef) .withSchema(new TableSchema().setFields(fieldDefs)) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND)); pipeLine.run();

The BigQuery table might look as below, BigQuery表可能如下所示，

如何在Java中使用Apache Beam ParDo函数读取JSON文件

问题描述

2 个解决方案

解决方案1
0 2018-12-27 18:54:29

解决方案2
0 2019-01-04 00:13:55

如何在Java中使用Apache Beam ParDo函数读取JSON文件

问题描述

2 个解决方案

解决方案1 0 2018-12-27 18:54:29

解决方案2 0 2019-01-04 00:13:55

解决方案1
0 2018-12-27 18:54:29

解决方案2
0 2019-01-04 00:13:55