Apache 束数据流大查询 IO 无模式

Question

Is there any way to write unstructured data to a big query table using apache beam dataflow big query io API (ie without providing schema upfront)有没有办法使用 apache beam dataflow big query io API 将非结构化数据写入大查询表（即不预先提供架构）

Answer 1

Bigquery needs to know the schema when it creates the table, or when one writes to it. Bigquery 在创建表或写入表时需要知道架构。 Depending on your situation one may be able to dynamically determine the schema in the pipeline construction code rather than hard coding it.根据您的情况，可以动态确定管道构造代码中的架构，而不是对其进行硬编码。

Answer 2

create a table with just a single STRING column to store data from Dataflow.创建一个只有一个 STRING 列的表来存储来自 Dataflow 的数据。

CREATE IF NOT EXISTS `your_project.dataset.rawdata` (
  raw STRING
);

You can store whatever data as a string without knowing the schema of it.您可以在不知道数据架构的情况下将任何数据存储为字符串。 For example, you can store a JSON data as a single string and a CSV as a string, etc.例如，您可以将 JSON 数据存储为单个字符串，将 CSV 存储为字符串等。

Specify the table as a destination of your Dataflow.将表指定为数据流的目标。 You may need to provide Dataflow with a javascript UDF which converts a message from a source to a single string which is compatible to a schema of above table.您可能需要为 Dataflow 提供 javascript UDF，该 UDF 将消息从源转换为与上表架构兼容的单个字符串。

/**
 * User-defined function (UDF) to transform events
 * as part of a Dataflow template job.
 *
 * @param {string} inJson input Pub/Sub JSON message (stringified)
 * @return {string} outJson output JSON message (stringified)
 */
function process(inJson) {
  var obj = JSON.parse(inJson),
      includePubsubMessage = obj.data && obj.attributes,
      data = includePubsubMessage ? obj.data : obj;
  
  // INSERT CUSTOM TRANSFORMATION LOGIC HERE

  return JSON.stringify(obj);
}

https://cloud.google.com/blog/topics/developers-practitioners/extend-your-dataflow-template-with-udfs https://cloud.google.com/blog/topics/developers-practitioners/extend-your-dataflow-template-with-udfs

you can see above sample UDF returns a JSON string.您可以看到上面的示例 UDF 返回一个 JSON 字符串。

You can later interpret the data with a schema (aka, schema-on-read strategy) like the following您稍后可以使用如下模式（也称为读取时模式策略）解释数据

SELECT JSON_VALUE(raw, '$.json_path_you_have') AS column1,
       JSON_QUERY_ARRAY(raw, '$.json_path_you_have') AS column2,
       ...
  FROM `your_project.dataset.rawdata`

Depending on your source data, you can use JSON functions or regular expressions to organize your data to a table with a schema you want.根据您的源数据，您可以使用 JSON 函数或正则表达式将数据组织到具有所需架构的表中。

Apache 束数据流大查询 IO 无模式

问题描述

2 个解决方案

解决方案1
1 2022-05-06 15:43:52

解决方案2
0 2022-05-06 16:24:35

Apache 束数据流大查询 IO 无模式

问题描述

2 个解决方案

解决方案1 1 2022-05-06 15:43:52

解决方案2 0 2022-05-06 16:24:35

解决方案1
1 2022-05-06 15:43:52

解决方案2
0 2022-05-06 16:24:35