简体   繁体   中英

Store kafka data in hdfs as parquet format using flink?

Store kafka data in hdfs as parquet format using flink, I am trying with fink documentation which is not working.

I am not finding any proper documentations to store it as parquet file

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(100);

final List<Datum> data = Arrays.asList(new Datum("a", 1), new Datum("b", 2), new Datum("c", 3));

DataStream<Datum> stream = env.addSource(new FiniteTestSource<>(data), TypeInformation.of(Datum.class));


stream.addSink(
    StreamingFileSink.forBulkFormat(
        Path.fromLocalFile(new File("path")),
        ParquetAvroWriters.forReflectRecord(String.class))
        .build());
env.execute();

I have created a serializable class

public static class Datum implements Serializable {

        public String a;
        public int b;

        public Datum() {
        }

        public Datum(String a, int b) {
            this.a = a;
            this.b = b;
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) {
                return true;
            }
            if (o == null || getClass() != o.getClass()) {
                return false;
            }

            Datum datum = (Datum) o;
            return b == datum.b && (a != null ? a.equals(datum.a) : datum.a == null);
        }

        @Override
        public int hashCode() {
            int result = a != null ? a.hashCode() : 0;
            result = 31 * result + b;
            return result;
        }
    }

The above code is not writing any data to file, it just keeps on creating many files.

If anyone can help with proper documentation or code

As written on the documentation of StreamingFileSink :

IMPORTANT: Checkpointing needs to be enabled when using the StreamingFileSink. Part files can only be finalized on successful checkpoints. If checkpointing is disabled part files will forever stay in in-progress or pending state and cannot be safely read by downstream systems.

To enable, just use

env.enableCheckpointing(1000);

You have quite a few options to tweak it.


Here is a complete example

final List<Address> data = Arrays.asList(
    new Address(1, "a", "b", "c", "12345"),
    new Address(2, "p", "q", "r", "12345"),
    new Address(3, "x", "y", "z", "12345")
);

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(100);

DataStream<Address> stream = env.addSource(
    new FiniteTestSource<>(data), TypeInformation.of(Address.class));

stream.addSink(
    StreamingFileSink.forBulkFormat(
        Path.fromLocalFile(folder),
        ParquetAvroWriters.forSpecificRecord(Address.class))
        .build());

env.execute();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM