简体   繁体   English

Kafka Avro 到 BigQuery 使用 Java 中的 Apache Beam

[英]Kafka Avro To BigQuery using Apache Beam in Java

Here is the scenario:这是场景:

Kafka To BigQuery using Apache Beam. Kafka 到 BigQuery 使用 Apache Beam。 This is an alternative to BigQuerySinkConnector [WePay] using Kafka Connect.这是使用 Kafka Connect 的 BigQuerySinkConnector [WePay] 的替代方案。

I have been able to read Avro message from Kafka Topic.我已经能够阅读来自 Kafka 主题的 Avro 消息。 I am also able to print the contents to console accurately.我还能够准确地将内容打印到控制台。 I am looking for help with writing these KafkaRecords to BigQuery table.我正在寻求将这些 KafkaRecords 写入 BigQuery 表的帮助。

PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline pipeline = Pipeline.create(options);
        //Customer is an auto generated class from avro schema using eclipse avro maven plugin
        // Read from Kafka Topic and get KafkaRecords
        @SuppressWarnings("unchecked")
        PTransform<PBegin, PCollection<KafkaRecord<String, Customer>>> input = KafkaIO.<String, Customer>read()
                    .withBootstrapServers("http://server1:9092")
                       .withTopic("test-avro")
                       .withConsumerConfigUpdates(ImmutableMap.of("specific.avro.reader", (Object)"true"))
                       .withConsumerConfigUpdates(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
                       .withConsumerConfigUpdates(ImmutableMap.of("schema.registry.url", (Object)"http://server2:8181"))
                       .withKeyDeserializer(StringDeserializer.class)
                       .withValueDeserializerAndCoder((Class)KafkaAvroDeserializer.class, AvroCoder.of(Customer.class));
        
        
        // Print kafka records to console log
        
        pipeline.apply(input)
                .apply("ExtractRecord", ParDo.of(new DoFn<KafkaRecord<String,Customer>, KafkaRecord<String,Customer>>() { 
                    @ProcessElement
                    public void processElement(ProcessContext c) {
                        KafkaRecord<String, Customer> record = (KafkaRecord<String, Customer>) c.element();
                        KV<String, Customer> log = record.getKV();
                        System.out.println("Key Obtained: " + log.getKey());
                        System.out.println("Value Obtained: " + log.getValue().toString());
                        c.output(record);
                        
                    }
                }));
                
        // Write each record to BigQuery Table 
        // Table is already available in BigQuery so create disposition would be CREATE_NEVER
        // Records to be appended to table - so write disposition would be WRITE_APPEND
        // All fields in the Customer object have corresponding column names and datatypes - so it is one to one mapping
        
        // Connection to BigQuery is through service account JSON file. This file has been set as environment variable in run config of eclipse project
        // Set table specification for BigQuery
          String bqTable = "my-project:my-dataset:my-table";

The current examples available - shows how to manually set a schema and assign field by field the values.当前可用的示例 - 显示如何手动设置架构并逐字段分配值。 I am looking for an automated way to infer the Customer Avro object and assign it to the columns directly without such manual field by field assignment.我正在寻找一种自动方法来推断客户 Avro object 并将其直接分配给列,而无需按字段分配此类手动字段。

Is this possible?这可能吗?

After much trial and error I was able to make the following work.经过多次试验和错误,我能够完成以下工作。

I would welcome review comments to share concerns / propose better solutions.我欢迎评论评论以分享关注/提出更好的解决方案。

        SchemaRegistryClient registryClient = new CachedSchemaRegistryClient(http://server2:8181,10);
        SchemaMetadata latestSchemaMetadata;
        Schema avroSchema = null; 
        try {
            // getLatestSchemaMetadata takes the subject name which is topic-value format where "-value" is suffixed to topic
            // so if topic is "test-avro" then subject is "test-avro-value"
            latestSchemaMetadata = registryClient.getLatestSchemaMetadata("test-avro-value");
            avroSchema = new Schema.Parser().parse(latestSchemaMetadata.getSchema());
        } catch (IOException e) {
            // TODO Auto-generated catch block
            System.out.println("IO Exception while obtaining registry data");
            e.printStackTrace();
        } catch (RestClientException e) {
            // TODO Auto-generated catch block
            System.out.println("Client Exception while obtaining registry data");
            e.printStackTrace();
            
        }
        
        // Printing avro schema obtained
        System.out.println("---------------- Avro schema ----------- " + avroSchema.toString());
        
        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline pipeline = Pipeline.create(options);
        
        
        // Read from Kafka Topic and get KafkaRecords
        // Create KafkaIO.Read with Avro schema deserializer
        @SuppressWarnings("unchecked")
        KafkaIO.Read<String, GenericRecord> read = KafkaIO.<String, GenericRecord>read()
            .withBootstrapServers("http://server1:9092")
            .withTopic(KafkaConfig.getInputTopic())
            .withConsumerConfigUpdates(ImmutableMap.of("schema.registry.url", "http://server2:8181"))
            .withConsumerConfigUpdates(ImmutableMap.of("specific.avro.reader", (Object)"true"))
            .withConsumerConfigUpdates(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
            .withKeyDeserializer(StringDeserializer.class)
            .withValueDeserializerAndCoder((Class) KafkaAvroDeserializer.class, AvroCoder.of(avroSchema));
        
        
        // Set Beam Schema
        
        org.apache.beam.sdk.schemas.Schema beamSchema = AvroUtils.toBeamSchema(avroSchema); 
        
        // Print kafka records to console log
                        
        // Write each record to BigQuery Table 
        // Table is already available in BigQuery so create disposition would be CREATE_NEVER
        // Records to be appended to table - so write disposition would be WRITE_APPEND
        // All fields in the Customer object have corresponding column names and datatypes - so it is one to one mapping
        
        // Connection to BigQuery is through service account JSON file. This file has been set as environment variable in run config of eclipse project
        // Set table specification for BigQuery
          String bqTable = "my-project:my-dataset:my-table";
        
        p.apply(read)
        .apply("ExtractRecord", ParDo.of(new DoFn<KafkaRecord<String,GenericRecord>, KV<String, GenericRecord>>() { 
            /**
             * 
             */
            private static final long serialVersionUID = 1L;

            @ProcessElement
            public void processElement(ProcessContext c) {
                KafkaRecord<String, GenericRecord> record = (KafkaRecord<String, GenericRecord>) c.element();
                KV<String, GenericRecord> log = record.getKV();
                System.out.println("Key Obtained: " + log.getKey());
                System.out.println("Value Obtained: " + log.getValue().toString());
                c.output(log);
                
            }
        }))
        .apply(Values.<GenericRecord>create()).setSchema(beamSchema, TypeDescriptor.of(GenericRecord.class) ,AvroUtils.getToRowFunction(GenericRecord.class, avroSchema), AvroUtils.getFromRowFunction(GenericRecord.class))
        .apply(BigQueryIO.<GenericRecord>write()
                .to(tableSpec)
                .useBeamSchema()
                .withCreateDisposition(CreateDisposition.CREATE_NEVER)
                .withWriteDisposition(WriteDisposition.WRITE_APPEND));
        
        p.run().waitUntilFinish();

The above works with CREATE_IF_NEEDED also.以上也适用于 CREATE_IF_NEEDED。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM