简体   繁体   中英

Apache beam dataflow Big query IO without schema

Is there any way to write unstructured data to a big query table using apache beam dataflow big query io API (ie without providing schema upfront)

Bigquery needs to know the schema when it creates the table, or when one writes to it. Depending on your situation one may be able to dynamically determine the schema in the pipeline construction code rather than hard coding it.

  1. create a table with just a single STRING column to store data from Dataflow.
CREATE IF NOT EXISTS `your_project.dataset.rawdata` (
  raw STRING
);

You can store whatever data as a string without knowing the schema of it. For example, you can store a JSON data as a single string and a CSV as a string, etc.

  1. Specify the table as a destination of your Dataflow. You may need to provide Dataflow with a javascript UDF which converts a message from a source to a single string which is compatible to a schema of above table.
/**
 * User-defined function (UDF) to transform events
 * as part of a Dataflow template job.
 *
 * @param {string} inJson input Pub/Sub JSON message (stringified)
 * @return {string} outJson output JSON message (stringified)
 */
function process(inJson) {
  var obj = JSON.parse(inJson),
      includePubsubMessage = obj.data && obj.attributes,
      data = includePubsubMessage ? obj.data : obj;
  
  // INSERT CUSTOM TRANSFORMATION LOGIC HERE

  return JSON.stringify(obj);
}

you can see above sample UDF returns a JSON string.

  1. You can later interpret the data with a schema (aka, schema-on-read strategy) like the following
SELECT JSON_VALUE(raw, '$.json_path_you_have') AS column1,
       JSON_QUERY_ARRAY(raw, '$.json_path_you_have') AS column2,
       ...
  FROM `your_project.dataset.rawdata`

Depending on your source data, you can use JSON functions or regular expressions to organize your data to a table with a schema you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM