简体   繁体   English

如何在Java中使用Apache Beam ParDo函数读取JSON文件

[英]How to read a JSON file using Apache beam parDo function in Java

I am new to Apache beam. 我是Apache Beam的新手。 As per our requirement I need to pass a JSON file containing five to 10 JSON records as input and read this JSON data from the file line by line and store into BigQuery. 根据我们的要求,我需要传递一个包含5至10条JSON记录的JSON文件作为输入,并逐行从文件中读取此JSON数据并存储到BigQuery中。 Can anyone please help me with my sample code below which tries to read JSON data using apache beam: 谁能帮我提供下面的示例代码,该示例代码尝试使用apache beam读取JSON数据:

PCollection<String> lines = 
    pipeline
      .apply("ReadMyFile", 
             TextIO.read()
                   .from("C:\\Users\\Desktop\\test.json")); 
if(null!=lines) { 
  PCollection<String> words =
     lines.apply(ParDo.of(new DoFn<String, String>() { 
        @ProcessElement
        public void processElement(ProcessContext c) { 
          String line = c.element();
        }
      })); 
  pipeline.run(); 
}

The answer is it depends. 答案是取决于。

TextIO reads the files line-by line. TextIO读取文件。 So in your test.json each line needs to contain a separate Json object. 因此,在test.json每一行都需要包含一个单独的Json对象。

The ParDo you have will then receive those lines one-by one, ie each call to @ProcessElement gets a single line. 然后,您拥有的ParDo将一一收到这些行,即,对@ProcessElement每次调用都将获得一行。

Then in your ParDo you can use something like Jackson ObjectMapper to parse the Json from the line (or any other Json parser you're familiar with, but Jackson is widely used, including few places in Beam itself. 然后在你的ParDo ,你可以使用像杰克逊ObjectMapper从线(或任何其他JSON分析器你熟悉解析JSON,但杰克逊被广泛使用,其中包括梁本身几个地方。

Overall the approach to writing a ParDo is this: 总体而言,编写ParDo的方法是:

  • get the c.element() ; 得到c.element() ;
  • do something to the value of c.element() , eg parse it from json into a java object; c.element()的值做一些事情,例如将其从json解析为java对象;
  • send the result of what you did to c.element() to c.output() ; 将您对c.element()所做的结果发送到c.output() ;

I would recommend starting by looking at Jackson extension to Beam SDK, it adds PTransforms to do exactly that, see this and this . 我建议从Beam SDK的Jackson扩展开始,它添加了PTransforms来做到这一点,请参见thisthis

Please also take a look at this post, it has some links. 也请看看这篇文章,它有一些链接。

There's also the JsonToRow transform that you can look for similar logic, the difference is that it doesn't parse the Json into a user-defined Java object but into a Beam Row class instead. 您还可以使用JsonToRow转换来寻找类似的逻辑,不同之处在于它不是将Json解析为用户定义的Java对象,而是解析为Beam Row类。

Before writing to BQ you need to convert the objects you parsed from Json into BQ rows, which will be another ParDo after your parsing logic, and then actually apply the BQIO as even another step. 在写入BQ之前,您需要将从Json解析的对象转换为BQ行,这将是解析逻辑之后的另一个ParDo ,然后实际应用BQIO作为下一步。 You can see few examples in BQ test . 您可以在BQ测试中看到一些示例。

Lets assume that we have a json strings in the file as below, 假设我们在文件中有一个json字符串,如下所示,

{"col1":"sample-val-1", "col2":1.0}
{"col1":"sample-val-2", "col2":2.0}
{"col1":"sample-val-3", "col2":3.0}
{"col1":"sample-val-4", "col2":4.0}
{"col1":"sample-val-5", "col2":5.0}

In order to store these values from file to BigQuery through DataFlow/Beam, you might have to follow below steps, 为了通过DataFlow / Beam将这些值从文件存储到BigQuery,您可能必须执行以下步骤,

  • Define a TableReference to refer the BigQuery table. 定义一个TableReference来引用BigQuery表。

  • Define TableFieldSchema for every column you expect to store. 为您希望存储的每一列定义TableFieldSchema。

  • Read the file using TextIO.read(). 使用TextIO.read()读取文件。

  • Create a DoFn to parse Json string to TableRow format. 创建一个DoFn将Json字符串解析为TableRow格式。

  • Commit the TableRow objects using BigQueryIO. 使用BigQueryIO提交TableRow对象。

You may refer the below code snippet regarding the above steps, 您可以参考以下有关上述步骤的代码段,

  • For TableReference and TableFieldSchema creation, 对于TableReference和TableFieldSchema创建,

     TableReference tableRef = new TableReference(); tableRef.setProjectId("project-id"); tableRef.setDatasetId("dataset-name"); tableRef.setTableId("table-name"); List<TableFieldSchema> fieldDefs = new ArrayList<>(); fieldDefs.add(new TableFieldSchema().setName("column1").setType("STRING")); fieldDefs.add(new TableFieldSchema().setName("column2").setType("FLOAT")); 
  • For the Pipeline steps, 对于管道步骤,

     Pipeline pipeLine = Pipeline.create(options); pipeLine .apply("ReadMyFile", TextIO.read().from("path-to-json-file")) .apply("MapToTableRow", ParDo.of(new DoFn<String, TableRow>() { @ProcessElement public void processElement(ProcessContext c) { Gson gson = new GsonBuilder().create(); HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class); TableRow row = new TableRow(); row.set("column1", parsedMap.get("col1").toString()); row.set("column2", Double.parseDouble(parsedMap.get("col2").toString())); c.output(row); } })) .apply("CommitToBQTable", BigQueryIO.writeTableRows() .to(tableRef) .withSchema(new TableSchema().setFields(fieldDefs)) .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED) .withWriteDisposition(WriteDisposition.WRITE_APPEND)); pipeLine.run(); 

The BigQuery table might look as below, BigQuery表可能如下所示,

在此处输入图片说明

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 ParDo function 不等待 Apache BEAM 中的 Window - ParDo function not waiting for Window in Apache BEAM Apache Beam:如何在使用重复数据删除时解决“ParDo 需要确定性密钥编码器才能使用 state 和计时器”function - Apache Beam: How to solve "ParDo requires a deterministic key coder in order to use state and timers" while using Deduplication function 如何在处理PCollection中的元素时将元素发布到kafka主题 <KV<String,String> &gt;在apache梁中的ParDo功能? - How to publish elements to a kafka topic while processing the elements in the PCollection<KV<String,String>> in ParDo function in apache beam? Java:使用 apache 光束管道读取存储在存储桶中的 excel 文件 - Java: read excel file stored in a bucket using apache beam pipeline 使用 java 读取 apache 光束中的多个 csv 文件 - Read multiple csv file in apache beam using java 如何从 apache 光束 java sdk 中的 minIO 读取文件 - How to read a file from minIO in apache beam java sdk 如何使用 Java 在 Apache Beam 中将 JSON 转换为 Parquet - How to convert JSON to Parquet in Apache Beam using Java 使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构 - Read Parquet file using Apache Beam Java SDK without providing schema 如何使用 Apache Bean Java 获取具有多个 ParDo 的数据流中的管道状态 - How to get the Pipeline status in Dataflow with multiple ParDo using Apache Bean Java 使用 apache 束谷歌数据流和 Z93F725A47423D21C83863 将具有未知 json 属性的大型 jsonl 文件转换为 csv - Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM