簡體   English   中英

Avro 到 BigTable - 架構問題?

[英]Avro to BigTable - Schema issue?

我正在嘗試使用 Dataflow 模板 [1] 將 Avro 文件(使用 Spark 3.0 生成)攝取到 BigTable 中,並出現以下錯誤。

注意 此文件可以在 Spark 和 Python avro庫中讀取,沒有明顯問題。

任何的想法?

謝謝您的支持 !

錯誤(短)

Caused by: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key

Avro 模式(摘錄)

{"type":"record","name":"topLevelRecord","fields":[{"name":"a_a","type": ["string", "null"]}, ...]}

錯誤(完整)

java.io.IOException: Failed to start reading from source: gs://myfolder/myfile.avro range [0, 15197631)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start (WorkerCustomSources.java:610)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start (ReadOperation.java:361)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop (ReadOperation.java:194)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start (ReadOperation.java:159)
at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute (MapTaskExecutor.java:77)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork (BatchDataflowWorker.java:417)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork (BatchDataflowWorker.java:386)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork (BatchDataflowWorker.java:311)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork (DataflowBatchWorkerHarness.java:140)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call (DataflowBatchWorkerHarness.java:120)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call (DataflowBatchWorkerHarness.java:107)
at java.util.concurrent.FutureTask.run (FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
at java.lang.Thread.run (Thread.java:834)
Caused by: org.apache.avro.AvroTypeException: Found topLevelRecord, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
at org.apache.avro.io.ResolvingDecoder.doAction (ResolvingDecoder.java:292)
at org.apache.avro.io.parsing.Parser.advance (Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readFieldOrder (ResolvingDecoder.java:130)
at org.apache.avro.generic.GenericDatumReader.readRecord (GenericDatumReader.java:215)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion (GenericDatumReader.java:175)
at org.apache.avro.generic.GenericDatumReader.read (GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.read (GenericDatumReader.java:145)
at org.apache.beam.sdk.io.AvroSource$AvroBlock.readNextRecord (AvroSource.java:644)
at org.apache.beam.sdk.io.BlockBasedSource$BlockBasedReader.readNextRecord (BlockBasedSource.java:210)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl (FileBasedSource.java:484)
at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl (FileBasedSource.java:479)
at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start (OffsetBasedSource.java:249)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start (WorkerCustomSources.java:607)

參考:

[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#avrofiletocloudbigtable

BigTable 是一個可擴展的 NoSQL 數據庫服務,這意味着它是無模式的; 而 Spark SQL 具有您在問題中指出的模式。

從下面的錯誤中,它指的是 BigTable 行鍵

expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key

因此,您需要按照此過程來創建 BigTable 架構設計。

由於 HBase 也是無架構的,如果您可以靈活地使用Spark 2.4.0 ,則可以通過使用Bigtable 和 HBase API來解決您的用例

至於上述用例,它看起來是一個有效的功能請求,我會將其提交給產品團隊並向您更新報告編號。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM