简体   繁体   English

使用flink将kafka数据以拼花格式存储在hdfs中?

[英]Store kafka data in hdfs as parquet format using flink?

Store kafka data in hdfs as parquet format using flink, I am trying with fink documentation which is not working.使用 flink 将 kafka 数据以镶木地板格式存储在 hdfs 中,我正在尝试使用不起作用的 fink 文档。

I am not finding any proper documentations to store it as parquet file我没有找到任何适当的文件来将其存储为镶木地板文件

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(100);

final List<Datum> data = Arrays.asList(new Datum("a", 1), new Datum("b", 2), new Datum("c", 3));

DataStream<Datum> stream = env.addSource(new FiniteTestSource<>(data), TypeInformation.of(Datum.class));


stream.addSink(
    StreamingFileSink.forBulkFormat(
        Path.fromLocalFile(new File("path")),
        ParquetAvroWriters.forReflectRecord(String.class))
        .build());
env.execute();

I have created a serializable class我创建了一个可序列化的类

public static class Datum implements Serializable {

        public String a;
        public int b;

        public Datum() {
        }

        public Datum(String a, int b) {
            this.a = a;
            this.b = b;
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) {
                return true;
            }
            if (o == null || getClass() != o.getClass()) {
                return false;
            }

            Datum datum = (Datum) o;
            return b == datum.b && (a != null ? a.equals(datum.a) : datum.a == null);
        }

        @Override
        public int hashCode() {
            int result = a != null ? a.hashCode() : 0;
            result = 31 * result + b;
            return result;
        }
    }

The above code is not writing any data to file, it just keeps on creating many files.上面的代码没有将任何数据写入文件,它只是不断地创建许多文件。

If anyone can help with proper documentation or code如果有人可以帮助提供适当的文档或代码

As written on the documentation of StreamingFileSink :正如documentation of StreamingFileSink所写:

IMPORTANT: Checkpointing needs to be enabled when using the StreamingFileSink.重要提示:使用 StreamingFileSink 时需要启用检查点。 Part files can only be finalized on successful checkpoints.零件文件只能在成功的检查点上完成。 If checkpointing is disabled part files will forever stay in in-progress or pending state and cannot be safely read by downstream systems.如果检查点被禁用,部分文件将永远处于in-progresspending状态,并且下游系统无法安全读取。

To enable, just use要启用,只需使用

env.enableCheckpointing(1000);

You have quite a few options to tweak it.你有很多选择来调整它。


Here is a complete example这是一个完整的例子

final List<Address> data = Arrays.asList(
    new Address(1, "a", "b", "c", "12345"),
    new Address(2, "p", "q", "r", "12345"),
    new Address(3, "x", "y", "z", "12345")
);

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(100);

DataStream<Address> stream = env.addSource(
    new FiniteTestSource<>(data), TypeInformation.of(Address.class));

stream.addSink(
    StreamingFileSink.forBulkFormat(
        Path.fromLocalFile(folder),
        ParquetAvroWriters.forSpecificRecord(Address.class))
        .build());

env.execute();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM