Facing Performance issue while reading files from GCS using apache beam

Question

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.

    PCollection<String> val = p
            .apply(FileIO.match()
                    .filepattern("gcsPath"))
            .apply(FileIO.readMatches().withCompression(Compression.BZIP2))
            .apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
                try {
                    return f.readFullyAsUTF8String();
                } catch (IOException e) {
                    return null;
                }
            }));

But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.

Answer 1

Below TextIO() code solved the issue.

PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
                        .withCompression(Compression.AUTO).withHintMatchesManyFiles()
                        );

withHintMatchesManyFiles() solved the issue. But still I don't know while FileIO performance is so bad.

Facing Performance issue while reading files from GCS using apache beam

Question

1 answers

solution1
0 ACCPTED 2020-02-11 18:07:42

Facing Performance issue while reading files from GCS using apache beam

Question

1 answers

solution1 0 ACCPTED 2020-02-11 18:07:42

solution1
0 ACCPTED 2020-02-11 18:07:42