简体   繁体   English

Google Dataflow:如何使用FileIO.ReadableFile中的有效JSON数组解析大文件

[英]Google Dataflow: how to parse big file with valid JSON array from FileIO.ReadableFile


In my pipeline FileIO.readMatches() transform reads big JSON file(around 300-400MB) with a valid JSON array and returns FileIO.ReadableFile object to the next transform. 在我的管道中, FileIO.readMatches()转换读取具有有效JSON数组的大JSON文件(大约300-400MB),并将FileIO.ReadableFile对象返回到下一个转换。 My task is to read each JSON object from that JSON array, add new properties and output to the next transform. 我的任务是从该JSON数组读取每个JSON对象,添加新属性并输出到下一个转换。

At the moment my code to parse the JSON file looks like this: 目前,我解析JSON文件的代码如下所示:

        // file is a FileIO.ReadableFile object 
        InputStream bis = new ByteArrayInputStream(file.readFullyAsBytes());
        // Im using gson library to parse JSON
        JsonReader reader = new JsonReader(new InputStreamReader(bis, "UTF-8"));
        JsonParser jsonParser = new JsonParser();
        reader.beginArray();
        while (reader.hasNext()) {
            JsonObject jsonObject = jsonParser.parse(reader).getAsJsonObject();
            jsonObject.addProperty("Somename", "Somedata");
            // processContext is a ProcessContext object
            processContext.output(jsonObject.toString());
        }
        reader.close();

In this case the whole content of the file will be in my memory which brings options to get java.lang.OutOfMemoryError. 在这种情况下,文件的全部内容将在我的内存中,这带来了获取java.lang.OutOfMemoryError的选项。 Im searching for solution to read one by one all JSON objects without keeping the whole file in my memory. 我正在寻找解决方案以一一读取所有JSON对象而不将整个文件保存在我的内存中。 Possible solution is to use method open() from object FileIO.ReadableFile which returns ReadableByteChannel channel but Im not sure how to use that channel to read specifically one JSON object from that channel. 可能的解决方法是使用对象FileIO.ReadableFile open()方法,该方法返回ReadableByteChannel通道,但是我不确定如何使用该通道从该通道中专门读取一个JSON对象。

Updated solution This is my updated solution which reads the file line by line 更新的解决方案这是我的更新的解决方案,它逐行读取文件

    ReadableByteChannel readableByteChannel = null;
    InputStream inputStream = null;
    BufferedReader bufferedReader = null;
    try {
        // file is a FileIO.ReadableFile 
        readableByteChannel = file.open();
        inputStream = Channels.newInputStream(readableByteChannel);
        bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
        String line;
        while ((line = bufferedReader.readLine()) != null) {
            if (line.length() > 1) {
                // my final output should contain both filename and line
                processContext.output(fileName + file);
            }
        }
    } catch (IOException ex) {
        logger.error("Exception during reading the file: {}", ex);
    } finally {
        IOUtils.closeQuietly(bufferedReader);
        IOUtils.closeQuietly(inputStream);
    }

I see that this solution doesnt work with Dataflow running on n1-standard-1 machine and throws java.lang.OutOfMemoryError: GC overhead limit exceeded exception and works correctly on n1-standard-2 machine. 我看到此解决方案不适用于在n1-standard-1机器上运行的Dataflow,并引发java.lang.OutOfMemoryError: GC overhead limit exceeded异常,并在n1-standard-2机器上正常工作。

ReadableByteChannel is a java NIO API, introduced in Java 7. Java provides a way to convert it to an InputStream: InputStream bis = Channels.newInputStream(file.open()); ReadableByteChannel是Java 7中引入的Java InputStream bis = Channels.newInputStream(file.open()); 提供了一种将其转换为InputStream的方法: InputStream bis = Channels.newInputStream(file.open()); - I believe this is the only change you need to make. -我相信这是您唯一需要做的改变。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM