通过Dataflow写入时如何处理BigQuery中的NULL值？

Question

I am ingesting data from one Database to BigQuery using the JdbcIO Source connector and BigQueryIO Sink connector provided by Apache Beam .我正在使用Apache Beam提供的JdbcIO Source 连接器和BigQueryIO Sink 连接器将数据从一个数据库提取到 BigQuery。

Below is my sample table data:下面是我的示例表数据：

As we can see few columns such as id , and booking_date contain NULL Value.正如我们所见， id和booking_date等几列包含 NULL 值。 So when I try to write data into BigQuery, it gives the below error因此，当我尝试将数据写入 BigQuery 时，它会出现以下错误

"message": "Error while reading data, error message: JSON parsing error in row starting at position 0: Only optional fields can be set to NULL. Field: status; Value: NULL

if I pass null in booking_date it gives an invalid date format error .如果我在booking_date中传递null ，它会给出一个invalid date format error 。

Below is the RowMapper I am using to convert JdbcIO resultset into TableRow .下面是我用来将JdbcIO结果集转换为TableRow的 RowMapper 。 it is the same code that GCP JdbcToBigQuery Template is using.它与GCP JdbcToBigQuery模板使用的代码相同。

public TableRow mapRow(ResultSet resultSet) throws Exception {
  ResultSetMetaData metaData = resultSet.getMetaData();
  TableRow outputTableRow = new TableRow();
  for (int i = 1; i <= metaData.getColumnCount(); i++) {
    if (resultSet.getObject(i) == null) {
      outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i));
    // outputTableRow.set(getColumnRef(metaData, i), String.valueOf(resultSet.getObject(i)));
      continue;
    }

/*
 * DATE:      EPOCH MILLISECONDS -> yyyy-MM-dd
 * DATETIME:  EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSS
 * TIMESTAMP: EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSSXXX
 *
 * MySQL drivers have ColumnTypeName in all caps and postgres in small case
 */
switch (metaData.getColumnTypeName(i).toLowerCase()) {
  case "date":
    outputTableRow.set(
        getColumnRef(metaData, i), dateFormatter.format(resultSet.getDate(i)));
    break;
  case "datetime":
    outputTableRow.set(
        getColumnRef(metaData, i),
        datetimeFormatter.format((TemporalAccessor) resultSet.getObject(i)));
    break;
  case "timestamp":
    outputTableRow.set(
        getColumnRef(metaData, i), timestampFormatter.format(resultSet.getTimestamp(i)));
    break;
  case "clob":
    Clob clobObject = resultSet.getClob(i);
    if (clobObject.length() > Integer.MAX_VALUE) {
      LOG.warn(
          "The Clob value size {} in column {} exceeds 2GB and will be truncated.",
          clobObject.length(),
          getColumnRef(metaData, i));
    }
    outputTableRow.set(
        getColumnRef(metaData, i), clobObject.getSubString(1, (int) clobObject.length()));
    break;
  default:
        outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i).toString());
    }
  }

  return outputTableRow;
}

Click here for more details JdbcToBigQuery单击此处了解更多详细信息JdbcToBigQuery

Solution I tried but did not get success解决方案我试过但没有成功

I tried to skip that particular column when it is null then it gives the error Missing required field当它是null时，我试图跳过该特定列，然后它给出错误Missing required field
I tried to hard code value for all cases as "null" so that I can handle this particular value later but it gives the error Could not convert value 'string_value: \t \"null\"' to integer我尝试将所有情况的值硬编码为“null”，以便稍后处理这个特定值，但它给出了错误Could not convert value 'string_value: \t \"null\"' to integer

How can I handle all Null case?我如何处理所有 Null 案件？ Please note, I will not be able to ignore these rows since few columns contain values.请注意，我将无法忽略这些行，因为很少有列包含值。

Answer 1

To solve your issue, you have to pass null if the date value is null and you have to set the associated BigQuery columns to NULLABLE :要解决您的问题，如果日期值为null并且您必须将关联的BigQuery列设置为NULLABLE ，则必须传递null ：

public TableRow mapRow(ResultSet resultSet) throws Exception {
  ResultSetMetaData metaData = resultSet.getMetaData();
  TableRow outputTableRow = new TableRow();
  for (int i = 1; i <= metaData.getColumnCount(); i++) {
    if (resultSet.getObject(i) == null) {
      outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i));
    // outputTableRow.set(getColumnRef(metaData, i), String.valueOf(resultSet.getObject(i)));
      continue;
    }

/*
 * DATE:      EPOCH MILLISECONDS -> yyyy-MM-dd
 * DATETIME:  EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSS
 * TIMESTAMP: EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSSXXX
 *
 * MySQL drivers have ColumnTypeName in all caps and postgres in small case
 */
public void yourMethod() {
    switch (metaData.getColumnTypeName(i).toLowerCase()) {
        case "date":
            String date = Optional.ofNullable(resultSet.getDate(i))
                    .map(d -> dateFormatter.format(d))
                    .orElse(null);
            
            outputTableRow.set(getColumnRef(metaData, i), date);
            break;
        case "datetime":
            String datetime = Optional.ofNullable(resultSet.getObject(i))
                    .map(d -> datetimeFormatter.format((TemporalAccessor) d))
                    .orElse(null);
            
            outputTableRow.set(getColumnRef(metaData, i), datetime);
            break;
        case "timestamp":
            String timestamp = Optional.ofNullable(resultSet.getTimestamp(i))
                    .map(t -> timestampFormatter.format(t))
                    .orElse(null);
            
            outputTableRow.set(getColumnRef(metaData, i), timestamp);
            break;
        case "clob":
            Clob clobObject = resultSet.getClob(i);
            if (clobObject.length() > Integer.MAX_VALUE) {
                LOG.warn(
                        "The Clob value size {} in column {} exceeds 2GB and will be truncated.",
                        clobObject.length(),
                        getColumnRef(metaData, i));
            }
            outputTableRow.set(
                    getColumnRef(metaData, i), clobObject.getSubString(1, (int) clobObject.length()));
            break;
        default:
            outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i).toString());
    }
    return outputTableRow;
}

For the date , datetime and timestamp blocs, I applied the transformation only if the value is not null, otherwise I retrieved default null value.对于date 、 datetime和 timestamp 块，我仅在值不是 null 时应用转换，否则我检索默认值 null。

通过Dataflow写入时如何处理BigQuery中的NULL值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2023-01-15 00:27:20

通过Dataflow写入时如何处理BigQuery中的NULL值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2023-01-15 00:27:20

解决方案1
2 已采纳 2023-01-15 00:27:20