简体   繁体   中英

Kafka to Google Cloud Platform Dataflow ingestion

What are the possible options that the Kafka data from the topics can be streamed, consumed and ingested into the BigQuery/Cloud storage.

As per, is it possible to Use Kafka with Google cloud Dataflow

GCP comes with Dataflow which is built on top of Apache Beam programming model. Is KafkaIO use with Beam Pipeline the recommended way to perform for real-time transformations on the incoming data?

https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

Kafka data can be pushed to cloud pub-sub and then onto BigQuery table. Kafka streams/Spark job that would sit out of GCP can also be used.

What are the factors to consider during the design decision given the Data is hosted entirely on Google Cloud Platform (GCP)?

Kafka support was added to Apache Beam in 2016, with the KafkaIO set of transformations. This means that Dataflow supports it as well.

The easiest thing for you to load data into BigQuery would be with an Apache Beam pipeline running on Dataflow. Your pipeline would look something like so:

Pipeline p = Pipeline.create();

p.apply("ReadFromKafka", KafkaIO.read()
                                .withTopic(myTopic)...)
 .apply("TransformData", ParDo.of(new FormatKafkaDataToBigQueryTableRow(mySchema))
 .apply(BigQueryIO.writeTableRows()
                  .to(myTableName)
                  .withSchema(mySchema));

p.run().waitUntilFinish();

The advantages of using a Beam pipeline on Dataflow are that you would not have to manage offsets, state, and consistency of data reads (vs. a custom-written process that reads from Kafka->BQ); nor a cluster (vs. a Spark job).

Finally, here is an example of a pipeline using KafkaIO .

You can use Kafka Connect and the BigQuery or GCS connectors.

In terms of transformation, you might be interested in KSQL (which is built on Kafka Streams), and is also covered in the same blog .

Disclaimer: I work for Confluent and wrote some of the above material.

Another possible option it's to use the Kafka Connect connector maintained by Google to upload data from Kafka to Pub-Sub. From Pub-Sub, you can easily use Dataflow to ingest in BigQuery or other Google services.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM