简体   繁体   中英

How to de-identify BigQuery data that stored in RECORD or REPEATED properties?

I'm trying to build a Dataflow pipeline that de-identifies data from the BigQuery table. I'm building com.google.privacy.dlp.v2.Table object and passing it to the ContentItem like that:

List<Field> fieldList = new ArrayList<>(
                bigquery
                        .getTable(table)
                        .getDefinition()
                        .getSchema()
                        .getFields());
  
    List<Table.Row> rows = new ArrayList<>();
    for (FieldValueList bigQueryRowItem : bigquery
            .listTableData(table)
            .getValues()) {
        Table.Row row = convertBigQueryRowToTableRow(bigQueryRowItem);
        rows.add(row);
    }
    Table dlpTable = Table
            .newBuilder()
            .addAllHeaders(convertFieldsToHeaders(fieldList))
            .addAllRows(rows)
            .build();

But unfortunately, this fieldList contains only first-level BigQuery properties without RECORD or REPEATED ones. Please advise me on how to get all fields' names from the table efficiently, including RECORD/REPEATED, and how to convert the BigQuery values to the Table.Row efficiently? Thank you

At the moment this is done one of two ways:

Flatten the fields into columns. So a record

  RecordA {Field1, Field2} becomes 2 columns, RecordA.Field1 and RecordA.Field2

For the repeated fields you can do the same or concat the field values together into a single cell.

RecordA { Field1: {value1,value2,value3] } becomes 3 columns .... RecordA.Field1[0], RecordA.Field2[1], and RecordA.Field3[2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM