![](/img/trans.png)
[英]Is there a way to match avro schema with Bigquery and Bigtable?
[英]Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?
我正在嘗試將谷歌雲數據存儲數據導出到谷歌雲存儲中的 Avro 文件,然后將這些文件加載到 BigQuery 中。
首先,我知道 Big Query 會加載數據存儲備份。 這有幾個我想避免的缺點:
澄清了這個實驗的動機后,我的數據流管道將數據導出為 avro 格式:
package com.example.dataflow;
import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.ByteArrayOutputStream;
public class GCDSEntitiesToAvroSSCCEPipeline {
private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
private static final String ENTITY_KIND = "Dummy";
private static Schema getSchema() {
return ProtobufData.get().getSchema(Entity.class);
}
private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p = Pipeline.create(options);
DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
.addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));
p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
.withDataset(options.as(DataflowPipelineOptions.class).getProject())
.withQuery(q.build())))
.apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
.setCoder(AvroCoder.of(getSchema()))
.apply(AvroIO.Write.named("WriteToAvro")
.to(GCS_TARGET_URI)
.withSchema(getSchema())
.withSuffix(".avro"));
p.run();
}
private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c) throws Exception {
Schema schema = getSchema();
ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
dataFileWriter.create(schema, bos);
dataFileWriter.append(c.element());
dataFileWriter.close();
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
new SeekableByteArrayInput(bos.toByteArray()), datumReader);
c.output(dataFileReader.next());
}
}
}
管道運行良好,但是當我嘗試將生成的 Avro 文件加載到大查詢中時,出現以下錯誤:
bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.
但是,如果我使用 avro 工具加載生成的 avro 文件,一切都很好:
avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...
在運行數據流管道之前,我使用此代碼用虛擬數據填充數據存儲:
package com.example.datastore;
import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;
import java.io.IOException;
public static void main(String[] args) throws IOException {
Datastore datastore = DatastoreOptions.builder()
.projectId("myProjectId")
.authCredentials(AuthCredentials.createApplicationDefaults())
.build().service();
KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");
Batch batch = datastore.newBatch();
int batchCount = 0;
for (int i = 0; i < 4000; i++){
IncompleteKey key = dummyKeyFactory.newKey();
System.out.println("adding entity " + i);
batch.add(Entity.builder(key).set("number", i).build());
batchCount++;
if (batchCount > 99) {
batch.submit();
batch = datastore.newBatch();
batchCount = 0;
}
}
System.out.println("done");
}
那么為什么 BigQuery 拒絕我的 avro 文件呢?
BigQuery 使用 C++ Avro 庫,顯然它不喜歡命名空間中的“$”。 這是錯誤消息:
命名空間無效:com.google.api.services.datastore.DatastoreV1$
我們正在努力將這些 Avro 錯誤消息發送給最終用戶。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.