简体   繁体   English

通过Dataflow写入时如何处理BigQuery中的NULL值?

[英]How to handle NULL Value in BigQuery while writing through Dataflow?

I am ingesting data from one Database to BigQuery using the JdbcIO Source connector and BigQueryIO Sink connector provided by Apache Beam .我正在使用Apache Beam提供的JdbcIO Source 连接器和BigQueryIO Sink 连接器将数据从一个数据库提取到 BigQuery。

Below is my sample table data:下面是我的示例表数据:

在此处输入图像描述

As we can see few columns such as id , and booking_date contain NULL Value.正如我们所见, idbooking_date等几列包含 NULL 值。 So when I try to write data into BigQuery, it gives the below error因此,当我尝试将数据写入 BigQuery 时,它会出现以下错误

"message": "Error while reading data, error message: JSON parsing error in row starting at position 0: Only optional fields can be set to NULL. Field: status; Value: NULL 

if I pass null in booking_date it gives an invalid date format error .如果我在booking_date中传递null ,它会给出一个invalid date format error

Below is the RowMapper I am using to convert JdbcIO resultset into TableRow .下面是我用来将JdbcIO结果集转换为TableRow的 RowMapper 。 it is the same code that GCP JdbcToBigQuery Template is using.它与GCP JdbcToBigQuery模板使用的代码相同。

public TableRow mapRow(ResultSet resultSet) throws Exception {
  ResultSetMetaData metaData = resultSet.getMetaData();
  TableRow outputTableRow = new TableRow();
  for (int i = 1; i <= metaData.getColumnCount(); i++) {
    if (resultSet.getObject(i) == null) {
      outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i));
    // outputTableRow.set(getColumnRef(metaData, i), String.valueOf(resultSet.getObject(i)));
      continue;
    }

/*
 * DATE:      EPOCH MILLISECONDS -> yyyy-MM-dd
 * DATETIME:  EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSS
 * TIMESTAMP: EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSSXXX
 *
 * MySQL drivers have ColumnTypeName in all caps and postgres in small case
 */
switch (metaData.getColumnTypeName(i).toLowerCase()) {
  case "date":
    outputTableRow.set(
        getColumnRef(metaData, i), dateFormatter.format(resultSet.getDate(i)));
    break;
  case "datetime":
    outputTableRow.set(
        getColumnRef(metaData, i),
        datetimeFormatter.format((TemporalAccessor) resultSet.getObject(i)));
    break;
  case "timestamp":
    outputTableRow.set(
        getColumnRef(metaData, i), timestampFormatter.format(resultSet.getTimestamp(i)));
    break;
  case "clob":
    Clob clobObject = resultSet.getClob(i);
    if (clobObject.length() > Integer.MAX_VALUE) {
      LOG.warn(
          "The Clob value size {} in column {} exceeds 2GB and will be truncated.",
          clobObject.length(),
          getColumnRef(metaData, i));
    }
    outputTableRow.set(
        getColumnRef(metaData, i), clobObject.getSubString(1, (int) clobObject.length()));
    break;
  default:
        outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i).toString());
    }
  }

  return outputTableRow;
}

Click here for more details JdbcToBigQuery单击此处了解更多详细信息JdbcToBigQuery

Solution I tried but did not get success解决方案我试过但没有成功

  • I tried to skip that particular column when it is null then it gives the error Missing required field当它是null时,我试图跳过该特定列,然后它给出错误Missing required field
  • I tried to hard code value for all cases as "null" so that I can handle this particular value later but it gives the error Could not convert value 'string_value: \t \"null\"' to integer我尝试将所有情况的值硬编码为“null”,以便稍后处理这个特定值,但它给出了错误Could not convert value 'string_value: \t \"null\"' to integer

How can I handle all Null case?我如何处理所有 Null 案件? Please note, I will not be able to ignore these rows since few columns contain values.请注意,我将无法忽略这些行,因为很少有列包含值。

To solve your issue, you have to pass null if the date value is null and you have to set the associated BigQuery columns to NULLABLE :要解决您的问题,如果日期值为null并且您必须将关联的BigQuery列设置为NULLABLE ,则必须传递null

public TableRow mapRow(ResultSet resultSet) throws Exception {
  ResultSetMetaData metaData = resultSet.getMetaData();
  TableRow outputTableRow = new TableRow();
  for (int i = 1; i <= metaData.getColumnCount(); i++) {
    if (resultSet.getObject(i) == null) {
      outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i));
    // outputTableRow.set(getColumnRef(metaData, i), String.valueOf(resultSet.getObject(i)));
      continue;
    }

/*
 * DATE:      EPOCH MILLISECONDS -> yyyy-MM-dd
 * DATETIME:  EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSS
 * TIMESTAMP: EPOCH MILLISECONDS -> yyyy-MM-dd hh:mm:ss.SSSSSSXXX
 *
 * MySQL drivers have ColumnTypeName in all caps and postgres in small case
 */
public void yourMethod() {
    switch (metaData.getColumnTypeName(i).toLowerCase()) {
        case "date":
            String date = Optional.ofNullable(resultSet.getDate(i))
                    .map(d -> dateFormatter.format(d))
                    .orElse(null);
            
            outputTableRow.set(getColumnRef(metaData, i), date);
            break;
        case "datetime":
            String datetime = Optional.ofNullable(resultSet.getObject(i))
                    .map(d -> datetimeFormatter.format((TemporalAccessor) d))
                    .orElse(null);
            
            outputTableRow.set(getColumnRef(metaData, i), datetime);
            break;
        case "timestamp":
            String timestamp = Optional.ofNullable(resultSet.getTimestamp(i))
                    .map(t -> timestampFormatter.format(t))
                    .orElse(null);
            
            outputTableRow.set(getColumnRef(metaData, i), timestamp);
            break;
        case "clob":
            Clob clobObject = resultSet.getClob(i);
            if (clobObject.length() > Integer.MAX_VALUE) {
                LOG.warn(
                        "The Clob value size {} in column {} exceeds 2GB and will be truncated.",
                        clobObject.length(),
                        getColumnRef(metaData, i));
            }
            outputTableRow.set(
                    getColumnRef(metaData, i), clobObject.getSubString(1, (int) clobObject.length()));
            break;
        default:
            outputTableRow.set(getColumnRef(metaData, i), resultSet.getObject(i).toString());
    }
    return outputTableRow;
} 

For the date , datetime and timestamp blocs, I applied the transformation only if the value is not null, otherwise I retrieved default null value.对于datedatetime和 timestamp 块,我仅在值不是 null 时应用转换,否则我检索默认值 null。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 通过数据流将数据写入云 bigTable 时出错 - getting error while writing data onto cloud bigTable through dataflow 如何在bigquery的动态执行语句中处理null? - How to handle null in dynamic execute statement in bigquery? 使用数据流 Kafka 到 bigquery 模板时出错 - Error while using dataflow Kafka to bigquery template 运行 DataFlow 作业时在 BigQuery 中记录重复 - Record Duplication in BigQuery while Running a DataFlow Job 如何将 Mongodb 中的“Where”条件用于 bigquery 数据流模板? - How to use "Where" condition in Mongodb to bigquery dataflow template? 如何使用 Dataflow 在 Apache Beam 中使用 CoGroupByKey 接收器到 BigQuery - How to use CoGroupByKey sink to BigQuery in Apache Beam using Dataflow 写入 Bigquery 时间单位列分区时,Dataflow 不会创建空分区 - Dataflow doesn’t create an empty partition when writing to a Bigquery time-unit column partition 在数据流中将 BigQuery TableRow 转换为 GenericRecord - Convert BigQuery TableRow to GenericRecord in dataflow 如何使用 Pub/Sub 和数据流将 HTML 字符串从 Cloud Functions 发送到 BigQuery? - How to send HTML string from Cloud Functions to BigQuery using Pub/Sub and Dataflow? 如何在 BigQuery 中获取 NaN 值? - How to get the NaN value in BigQuery?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM