简体   繁体   English

使用 apache 束谷歌数据流和 Z93F725A47423D21C83863 将具有未知 json 属性的大型 jsonl 文件转换为 csv

[英]Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java如何使用 Apache Beam、google 数据流和 Z923F718464Z8B323C1F725A048B323C1F725A048423C1F725A048423C1F725A07B323FZ 将具有未知 json 属性的大型 jsonl 文件转换为 csv

Here is my scenario:这是我的场景:

  1. A large jsonl file is in google storage谷歌存储中有一个大的 jsonl 文件
  2. Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline. Json 属性未知,所以使用 Apache Beam 的 Schema 无法在 Beam 的 pipeline 中定义。
  3. Use Apache beam, google dataflow and java to convert jsonl to csv使用 Apache 光束、google 数据流和 java 将 jsonl 转换为 csv
  4. Once transformation is done, store csv in google storage (same bucket where jsonl is stored)转换完成后,将 csv 存储在谷歌存储中(存储 jsonl 的同一存储桶)
  5. Notify by some means, like transformation_done=true if possible (rest api or event)如果可能,通过某种方式通知,例如 transformation_done=true(其余 api 或事件)

Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.任何帮助或指导都会有所帮助,因为我是 Apache 光束的新手,尽管我正在阅读 Apache 光束的文档。

I have edited the question with an example JSONL data我已经使用示例 JSONL 数据编辑了问题

{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"} {"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"} {"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"} {"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}

While json key's are there in an input file but it's not known while transforming.虽然 json 密钥存在于输入文件中,但在转换时不知道。 I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.我将通过一个例子来解释,假设我有三个客户端,每个客户端都有自己的谷歌存储,所以每个客户端都上传自己的具有不同 json 属性的 jsonl 文件。

Client 1: Input Jsonl File客户端1:输入Jsonl文件

{"city":"Mumbai", "pincode":"2012A"} {"city":"Delhi", "pincode":"2012N"}

Client 2: Input Jsonl File客户端2:输入Jsonl文件

{"Relation":"Finance", "Code":"2012A"} {"Relation":"Production", "Code":"20XXX"}

Client 3: Input Jsonl File客户端 3:输入 Jsonl 文件

{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"} {"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}

Question: How could I write A Generic beam pipeline which transforms all three as shown below问题:我如何编写一个通用光束管道来转换所有三个,如下所示

Client 1: Output CSV file客户端 1:Output CSV 文件

["city", "pincode"] ["Mumbai","2012A"] ["Delhi", "2012N"]

Client 2: Output CSV file客户端 2:Output CSV 文件

["Relation", "Code"] ["Finance", "2012A"] ["Production","20XXX"]

Client 3: Output CSV file客户端 3:Output CSV 文件

["Name", "Session", "Score", "true"] ["Gilbert", "2013", "24", "true"] ["Alexa", "2013", "29", "true"]

Edit: Removed the previous ans as questions have been modified with examples.编辑:删除了以前的答案,因为问题已通过示例进行了修改。

There is no generic way provided by anyone to achieve such result.任何人都没有提供通用的方法来实现这样的结果。 You have to write the logic yourself depending on your requirements and how you are handling the pipeline.您必须根据自己的要求以及处理管道的方式自己编写逻辑。

Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.下面有一些示例,但您需要根据您的情况验证这些示例,因为我只在一个小的JSONL文件上尝试过这些示例。

TextIO文本IO


Approach 1方法一
If you can collect the header value of the output csv then it will be much easier.如果能收集到 output csv 的 header 值,那就容易多了。 But getting the header beforehand itself another challenge.但是事先获得 header 本身就是另一个挑战。

 //pipeline pipeline.apply("ReadJSONLines", TextIO.read().from("FILE URL")).apply(ParDo.of(new DoFn<String, String>() { @ProcessElement public void processLines(@Element String line, OutputReceiver<String> receiver) { String values = getCsvLine(line, false); receiver.output(values); } })).apply("WriteCSV", TextIO.write().to("FileName").withSuffix(".csv").withoutSharding().withDelimiter(new char[] { '\r', '\n' }).withHeader(getHeader()));
 private static String getHeader() { String header = ""; //your logic to get the header line. return header; }

probable ways to get the header line(Only assumptions may not work in your case):获得 header 线的可能方法(仅假设可能不适用于您的情况):

  • You can have a text file in GCS which will store the header of a particular JSON File.您可以在GCS中有一个文本文件,它将存储特定 JSON 文件的 header。 And in your logic you can fetch the header by reading the file, check this SO thread about how to read files from GCS在您的逻辑中,您可以通过读取文件来获取 header, 查看此 SO 线程关于如何从 GCS 读取文件
  • You can try to pass the header as a runtime argument but that depends how you are configuring and executing your pipeline.您可以尝试将 header 作为运行时参数传递,但这取决于您如何配置和执行管道。

Approach 2方法二
This is a workaround I found for small JsonFiles(~10k lines).这是我为小型 JsonFiles(~10k 行)找到的解决方法。 This below example may not work for large files.下面的示例可能不适用于大文件。

 final int[] count = { 0 }; pipeline.apply(//read file).apply(ParDo.of(new DoFn<String, String>() { @ProcessElement public void processLines(@Element String line, OutputReceiver<String> receiver) { // check if its the first processing element. If yes then create the header if (count[0] == 0) { String header = getCsvLine(line, true); receiver.output(header); count[0]++; } String values = getCsvLine(line, false); receiver.output(values); } })).apply(//write file)

FileIO文件IO


As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:正如Saransh在评论中提到的,使用FileIO您所要做的就是手动逐行读取 JSONL,然后将其转换为逗号分隔的格式。例如:

 pipeline.apply(FileIO.match().filepattern("FILE PATH")).apply(FileIO.readMatches()).apply(FlatMapElements.into(TypeDescriptors.strings()).via((FileIO.ReadableFile f) -> { List<String> output = new ArrayList<>(); try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) { String line = br.readLine(); while (line.= null) { if (output,size() == 0) { String header = getCsvLine(line; true). output;add(header), } String result = getCsvLine(line; false). output;add(result). line = br;readLine(), } } catch (IOException e) { throw new RuntimeException("Error while reading"; e); } return output. })) apply(//write to gcs)

In the above examples I have used a getCsvLine method(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON .在上面的示例中,我使用了getCsvLine方法(为代码可用性而创建),它从文件中获取一行并将其转换为逗号分隔的格式。解析JSON object 我使用了ZB0AA3DCF4968BF44E701A

 /** * @param line take each JSONL line * @param isHeader true: Returns output combining the JSON keys || false: * Returns output combining the JSON values **/ public static String getCsvLine(String line, boolean isHeader) { List<String> values = new ArrayList<>(); // convert the line into jsonobject JsonObject jsonObject = JsonParser.parseString(line).getAsJsonObject(); // iterate json object and collect all values for (Map.Entry<String, JsonElement> entry: jsonObject.entrySet()) { if (isHeader) values.add(entry.getKey()); else values.add(entry.getValue().getAsString()); } String result = String.join(",", values); return result; }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM