简体   繁体   English

如何使用 Apache Beam 中的 BigQuery IO 写入 BigQuery?

[英]How to write to BigQuery with BigQuery IO in Apache Beam?

I'm trying to set up an Apache Beam pipeline that reads from Kafka and writes to BigQuery using Apache Beam.我正在尝试设置一个 Apache Beam 管道,该管道从 Kafka 读取并使用 Apache Beam 写入 BigQuery。 I'm using the logic from here to filter out some coordinates: https://www.talend.com/blog/2018/08/07/developing-data-processing-job-using-apache-beam-streaming-pipeline/ TLDR: the messages in the topic are of the format id,x,y.我正在使用这里的逻辑来过滤掉一些坐标: https://www.talend.com/blog/2018/08/07/developing-data-processing-job-using-apache-beam-streaming-pipeline/ TLDR:主题中的消息格式为 id,x,y。 filter out all messages where x>100 or y>100过滤掉所有 x>100 或 y>100 的消息

I read the data, do couple of transforms, then define my table schema and then try to write to Bigquery.我读取数据,进行几次转换,然后定义我的表模式,然后尝试写入 Bigquery。 I'm not exactly sure how to call the write method.我不确定如何调用 write 方法。 It's maybe a lack of Java Generics knowledge.这可能是缺乏 Java Generics 知识。 I believe it should be a PCollection, but can't quiet figure it out.我相信它应该是一个 PCollection,但不能安静地弄清楚。

Here is the pipeline code - appologies if it's considered code dump, I just want to give the whole context:这是管道代码 - 如果它被认为是代码转储,我只想给出整个上下文:

    Pipeline pipeline = Pipeline.create(options);
    pipeline
        .apply(
                KafkaIO.<Long, String>read()
                        .withBootstrapServers(options.getBootstrap())
                        .withTopic(options.getInputTopic())
                        .withKeyDeserializer(LongDeserializer.class)
                        .withValueDeserializer(StringDeserializer.class))
        .apply(
                ParDo.of(
                        new DoFn<KafkaRecord<Long, String>, String>() {
                            @ProcessElement
                            public void processElement(ProcessContext processContext) {
                                KafkaRecord<Long, String> record = processContext.element();
                                processContext.output(record.getKV().getValue());
                            }
                        }))
        .apply(
                "FilterValidCoords",
                Filter.by(new FilterObjectsByCoordinates(options.getCoordX(), options.getCoordY())))
        .apply(
                "ExtractPayload",
                ParDo.of(
                        new DoFn<String, KV<String, String>>() {
                            @ProcessElement
                            public void processElement(ProcessContext c) throws Exception {
                                c.output(KV.of("filtered", c.element()));
                            }
                        }));

        TableSchema tableSchema =
        new TableSchema()
                .setFields(
                        ImmutableList.of(
                                new TableFieldSchema()
                                        .setName("x_cord")
                                        .setType("STRING")
                                        .setMode("NULLABLE"),
                        new TableFieldSchema()
                                .setName("y_cord")
                                .setType("STRING")
                                .setMode("NULLABLE")

                        ));
        pipeline
                .apply(
                "Write data to BQ",
                BigQueryIO
                        .<String, KV<String, String>>write() //I'm not sure how to call this method
                        .optimizedWrites()
                        .withSchema(tableSchema)
                        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                        .withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
                        .withMethod(FILE_LOADS)
                        .to(new TableReference()
                                .setProjectId("prod-analytics-264419")
                                .setDatasetId("publsher")
                                .setTableId("beam_load_test"))
        );

You want something like this:你想要这样的东西:

[..] 
pipeline.apply(BigQueryIO.writeTableRows()
        .to(String.format("%s.dataset.table", options.getProject()))
        .withCreateDisposition(CREATE_IF_NEEDED)
        .withWriteDisposition(WRITE_APPEND)
        .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
        .withSchema(getTableSchema()));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM