简体   繁体   中英

Reading Avro files in GCS as PCollection<GenericRecord>

Our Dataflow job, written in Python, listens from a Pubsub subscription. The messages are strings of GCS file path of Avro files ( gs://bucket/file-timestamp.avro ). The avro files are not of uniform schema, but Beam Python's avroio.ReadAllFromAvro is good enough to parse generically the records of the avro file in GCS as dictionaries.

# Python code
pipeline
    | "Read from Pub/Sub" >> beam.io.ReadFromPubSub(subscription=input_subscription)
    | "Read Avro Files" >> avroio.ReadAllFromAvro(with_filename=True)

The expected form of the resulting PTransform is a tuple: (same_gcs_filepath, record_as_dict)

Because of some limitations in Python, we need to convert our code to Java. We can read from Pubsub alright, however, we don't know yet how to use AvroIO in order to get a resulting PCollection<GenericRecord> without specifying a schema.

// Java code
pipeline
  .apply("Read from Pub/Sub", PubsubIO.fromSubscription(options.getInputSubscription()))
  .apply("Read Avro Files", AvroIO.__________________)

It would be better if the original gcs_filepath is passed along with it just like what we do in Python. Do I need to use FileIO in tandem with AvroIO ? How do I achieve this?

Update:

pipeline
  .apply("Read from Pub/Sub", PubsubIO.fromSubscription(options.getInputSubscription()))
  .apply("Read Avro Files", FileIO.matchAll())
  .apply(FileIO.readMatches())
  .apply(AvroIO.parseFilesGenericRecords(record -> record))

The code above lead me to some progress, but then it is complaining that I should be providing a coder

java.lang.IllegalArgumentException: Unable to infer coder for output of parseFn. Specify it explicitly using withCoder()

which in the last line should look like this:

  .apply(AvroIO.parseFilesGenericRecords(record -> record).withCoder(someCoder))

But then, someCoder requires a specific schema, which I can't really provide as the incoming avro file can be of different schema.

To read a PCollection of filepatterns whose schema is unknown at pipeline construction time or differs between files, you can use parseAllGenericRecords(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.generic.GenericRecord, T>)

In your case, your serializable function can just return the GenericRecord

For example:


 Pipeline p = ...;

 PCollection<GenericRecord> records =
     p.apply(AvroIO.parseAllGenericRecords(new SerializableFunction<GenericRecord, GenericRecord>() {
       public GenericRecord apply(GenericRecord record) {
         // If needed, access the schema of the record using record.getSchema()
         return record;
       }
     }));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM