The "Parquet Files on Cloud Storage to Cloud Bigtable" DataFlow template cannot read parquet files

Question

I'm trying to move a parquet file that is written out in R using the arrow library to BigTable. I have validated the arrow package instalation and made sure that the snappy codec is available using codec_is_available("snappy") .

For some reason in the third step of the workflow I run into the following error:

Error message from worker: java.lang.RuntimeException: 
org.apache.beam.sdk.util.UserCodeException: 
org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 1 in file 
ReadableFile{
  metadata=Metadata{
    resourceId=gs://mybucket/userdata_2.parquet, 
    sizeBytes=85550, 
    isReadSeekEfficient=true, 
    checksum=null, 
    lastModifiedMillis=0}, compression=UNCOMPRESSED}

It is unclear to me why it gives this error, but also why it says compression=UNCOMPRESSED . The file has been compressed with snappy .

I have tried to change the arrow version from 1.0 to 2.0, and have tried to change compression codecs, including uncompressed (even though the uncompressed format does not seem to be supported by Google Data Flow). The error stays the same.

Using a utility like parquet-tools gives no indication that there is anything wrong with the files I'm uploading.

Is there any special requirement to the parquet format for Google Data Flow that I'm missing here? I've iterated through the ones available to me in the arrow package to no avail.

Answer 1

I was also seeing this error when trying to use my own pyarrow-generated parquets with the parquet_to_bigtable dataflow template.

The issue boiled down to schema mismatches. While the data in the parquet matched the expected format perfectly, and printing known-good and my own versions showed the exact same contents, parquets contain additional metadata that describes the schema, like so:

➜  ~ parq my_pyarrow_generated.parquet -s

 # Schema 
 <pyarrow._parquet.ParquetSchema object at 0x12d7164c0>
required group field_id=-1 schema {
  optional binary field_id=-1 key;
  optional group field_id=-1 cells (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 item {
        optional binary field_id=-1 family (String);
        optional binary field_id=-1 qualifier;
        optional double field_id=-1 timestamp;
        optional binary field_id=-1 value;
      }
    }
  }
}

I knew this schema probably wasn't precisely what they use themselves, so to get an understanding of how far off I was from what was needed, I used the inverse template bigtable_to_parquet to get a sample parquet file that has the correct metadata encoded within it:

➜  ~ parq dataflow_bigtable_to_parquet.parquet -s                                                           

 # Schema 
 <pyarrow._parquet.ParquetSchema object at 0x1205c6a80>
required group field_id=-1 com.google.cloud.teleport.bigtable.BigtableRow {
  required binary field_id=-1 key;
  required group field_id=-1 cells (List) {
    repeated group field_id=-1 array {
      required binary field_id=-1 family (String);
      required binary field_id=-1 qualifier;
      required int64 field_id=-1 timestamp;
      required binary field_id=-1 value;
    }
  }
}

as seen, the schemas are very close, but not exact.

With this though, we can build a simple workaround. It's gross, but I'm still actively debugging this right now and this is what just worked finally.

bigtable_schema_parquet = pq.read_table(pa.BufferReader(bigtable_to_parquet_file_bytes))
keys = []
cells = []
.......
df = pd.DataFrame({'key': keys, 'cells': cells})
table = pa.Table.from_pandas(df, schema=bigtable_schema_parquet.schema)

tl;dr: Use the bigtable_to_parquet dataflow template to get a sample parquet that has the schema that the parquet_to_bigtable input must use. Then load that schema in-memory and pass it to from_pandas to overwrite whatever schema it would have otherwise inferred

The "Parquet Files on Cloud Storage to Cloud Bigtable" DataFlow template cannot read parquet files

Question

1 answers

solution1
0 2022-12-20 02:53:52

The "Parquet Files on Cloud Storage to Cloud Bigtable" DataFlow template cannot read parquet files

Question

1 answers

solution1 0 2022-12-20 02:53:52

solution1
0 2022-12-20 02:53:52