简体   繁体   English

为什么 BigQuery 无法解析 avro-tools 接受的 Avro 文件?

[英]Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?

I'm trying to export google cloud datastore data to Avro files in google cloud storage and then load those files into BigQuery.我正在尝试将谷歌云数据存储数据导出到谷歌云存储中的 Avro 文件,然后将这些文件加载到 BigQuery 中。

Firstly, I know that Big Query loads datastore backups.首先,我知道 Big Query 会加载数据存储备份。 This has several disadvantages that I'd like to avoid:这有几个我想避免的缺点:

With the motivation clarified for this experiment here is my Dataflow Pipeline to export the data to avro format:澄清了这个实验的动机后,我的数据流管道将数据导出为 avro 格式:

package com.example.dataflow;

import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.ByteArrayOutputStream;

public class GCDSEntitiesToAvroSSCCEPipeline {

    private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
    private static final String ENTITY_KIND = "Dummy";

    private static Schema getSchema() {
        return ProtobufData.get().getSchema(Entity.class);
    }

    private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
        Pipeline p = Pipeline.create(options);

        DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
                .addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));

        p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
                .withDataset(options.as(DataflowPipelineOptions.class).getProject())
                .withQuery(q.build())))
            .apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
            .setCoder(AvroCoder.of(getSchema()))
            .apply(AvroIO.Write.named("WriteToAvro")
                    .to(GCS_TARGET_URI)
                    .withSchema(getSchema())
                    .withSuffix(".avro"));
        p.run();

    }

    private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
        private static final long serialVersionUID = 1L;

        @Override
        public void processElement(ProcessContext c) throws Exception {
            Schema schema = getSchema();
            ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
            DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            dataFileWriter.create(schema, bos);
            dataFileWriter.append(c.element());
            dataFileWriter.close();

            DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
            DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
                    new SeekableByteArrayInput(bos.toByteArray()), datumReader);

            c.output(dataFileReader.next());

        }
    }
}

The pipeline runs fine, however when I try to load the resultant Avro file into big query I get the following error:管道运行良好,但是当我尝试将生成的 Avro 文件加载到大查询中时,出现以下错误:

bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE   
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.

However if I load the resultant avro file with avro tool, everything is just fine:但是,如果我使用 avro 工具加载生成的 avro 文件,一切都很好:

avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...

I used this code to populate the datastore with dummy data before running the Dataflow pipeline:在运行数据流管道之前,我使用此代码用虚拟数据填充数据存储:

package com.example.datastore;

import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;

import java.io.IOException;

public static void main(String[] args) throws IOException {

    Datastore datastore = DatastoreOptions.builder()
            .projectId("myProjectId")
            .authCredentials(AuthCredentials.createApplicationDefaults())
            .build().service();

    KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");


    Batch batch = datastore.newBatch();
    int batchCount = 0;
    for (int i = 0; i < 4000; i++){
        IncompleteKey key = dummyKeyFactory.newKey();
        System.out.println("adding entity " + i);
        batch.add(Entity.builder(key).set("number", i).build());
        batchCount++;
        if (batchCount > 99) {
            batch.submit();
            batch = datastore.newBatch();
            batchCount = 0;
        }
    }

    System.out.println("done");

}

So why is BigQuery rejecting my avro files?那么为什么 BigQuery 拒绝我的 avro 文件呢?

BigQuery uses the C++ Avro library, and apparently it doesn't like the "$" in the namespace. BigQuery 使用 C++ Avro 库,显然它不喜欢命名空间中的“$”。 Here's the error message:这是错误消息:

Invalid namespace: com.google.api.services.datastore.DatastoreV1$命名空间无效:com.google.api.services.datastore.DatastoreV1$

We're working on getting these Avro error messages out to the end user.我们正在努力将这些 Avro 错误消息发送给最终用户。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 有没有办法将 avro 模式与 Bigquery 和 Bigtable 匹配? - Is there a way to match avro schema with Bigquery and Bigtable? Schema avro 在时间戳中,但在 bigquery 中以整数形式出现 - Schema avro is in timestamp but in bigquery comes as integer 在 BigQuery 中加载 avro 文件 - 默认值的意外类型。 预期为 null,但找到了字符串:“null” - Load a avro file in BigQuery - Unexpected type for default value. Expected null, but found string: "null" Beam 写入 avro 文件序列化错误 - Beam write to avro file serialization error 在 GCS 中读取 Avro 文件作为 PCollection<genericrecord></genericrecord> - Reading Avro files in GCS as PCollection<GenericRecord> 在 Apache Beam/Dataflow 的 WriteToBigQuery 转换中,如何使用 Method.FILE_LOADS 和 Avro temp_file_format 启用死信模式 - In Apache Beam/Dataflow's WriteToBigQuery transform, how do you enable the deadletter pattern with Method.FILE_LOADS and Avro temp_file_format 重复列类型的 PubSub 订阅错误 - Avro 架构 - PubSub Subscription error with REPEATED Column Type - Avro Schema 为什么 BigQuery 不显示 Firebase-Analytics 数据? - Why BigQuery does not show Firebase-Analytics data? BigQuery 无法将“null”解析为字段的 int - BigQuery Could not parse 'null' as int for field 为什么这个 BigQuery 事务没有回滚? - Why this BigQuery transaction is not rollbacked?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM