I'm trying to build a Dataflow pipeline that de-identifies data from the BigQuery table. I'm building com.google.privacy.dlp.v2.Table object and passing it to the ContentItem like that:
List<Field> fieldList = new ArrayList<>(
bigquery
.getTable(table)
.getDefinition()
.getSchema()
.getFields());
List<Table.Row> rows = new ArrayList<>();
for (FieldValueList bigQueryRowItem : bigquery
.listTableData(table)
.getValues()) {
Table.Row row = convertBigQueryRowToTableRow(bigQueryRowItem);
rows.add(row);
}
Table dlpTable = Table
.newBuilder()
.addAllHeaders(convertFieldsToHeaders(fieldList))
.addAllRows(rows)
.build();
But unfortunately, this fieldList contains only first-level BigQuery properties without RECORD or REPEATED ones. Please advise me on how to get all fields' names from the table efficiently, including RECORD/REPEATED, and how to convert the BigQuery values to the Table.Row efficiently? Thank you
At the moment this is done one of two ways:
Flatten the fields into columns. So a record
RecordA {Field1, Field2} becomes 2 columns, RecordA.Field1 and RecordA.Field2
For the repeated fields you can do the same or concat the field values together into a single cell.
RecordA { Field1: {value1,value2,value3] } becomes 3 columns .... RecordA.Field1[0], RecordA.Field2[1], and RecordA.Field3[2]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.