从数据流向BigQuery插入数据

Question

Previously, had PCollection formattedResults; 以前有PCollection formattedResults; and I was using below code to insert rows in big query: 我正在使用下面的代码在大查询中插入行：

                   // OPTION 1
PCollection<TableRow> formattedResults = ....
formattedResults.apply(BigQueryIO.Write.named("Write").to(tableName)
                            .withSchema(tableSchema)
                            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

And all rows was directly inserted in BigQuery, all well till here. 并且所有行都直接插入到BigQuery中，直到现在为止。 But now I have started to dynamically identify the table name and its row so am creating PCollection as below: (String will be the table name and then its row as value) 但是现在我已经开始动态地标识表名及其行，因此按如下方式创建PCollection ：（字符串将是表名，然后将其行作为值）

PCollection<KV<String, TableRow>>   tableRowMap // OPTION 2

Also, I am creating group of rows which will go in same table as: 另外，我正在创建一组行，它们将在同一表中显示：

PCollection<KV<String, Iterable<TableRow>>> groupedRows  //OPTION 3

where key (String) is the BQ table name and value is the list of rows to be inserted in BQ. 其中键（字符串）是BQ表名，值是要在BQ中插入的行的列表。

With option 1, I was able to easily insert rows in BQ using code shown above but same code cannot be used with OPTION 2 or OPTION 3 because in this case my table name is key in map. 使用选项1，我能够使用上面显示的代码轻松地在BQ中插入行，但是相同的代码不能用于OPTION 2或OPTION 3，因为在这种情况下，我的表名是map中的键。 Is there a way to insert rows in table using OPTION 2 or OPTION 3. Any link or code sample will be great help. 有没有一种方法可以使用OPTION 2或OPTION 3在表中插入行。任何链接或代码示例都将提供很大的帮助。

Answer 1

The closest thing that Dataflow is writing to a table per window (and you can create your own BoundedWindow subclass and WindowFn to include whatever data you want in the window). Dataflow最接近每个窗口写入表的内容（您可以创建自己的BoundedWindow子类和WindowFn以在窗口中包含所需的任何数据）。 To do this, use 为此，请使用

to(SerializableFunction<BoundedWindow,String> tableSpecFunction)

on BigQueryIO.Write. 在BigQueryIO.Write上。

Note that this functionality uses BigQuery's streaming upload feature, which is limited to 100MB/s per table. 请注意，此功能使用BigQuery的流式上传功能，每个表的最大上传速度为100MB / s。 Additionally, uploads are not atomic, so a failed batch job may upload only part of the output. 此外，上传不是原子的，因此失败的批处理作业可能仅上传部分输出。

Answer 2

You've also got the option to create you're own DoFn that directly inserts data into bigquery, instead of relying on BigQueryIO.Write. 您还可以选择创建自己的DoFn，该DoFn直接将数据插入到bigquery中，而不是依赖于BigQueryIO.Write。 Technically you need to create a BigQueryTableInserter , you can use the insertAll(TableReference ref, List<TableRow> rowList) to insert stuff into your desired table. 从技术上讲，您需要创建一个BigQueryTableInserter ，您可以使用insertAll(TableReference ref, List<TableRow> rowList)将内容插入所需的表中。

You can create a TableReference using something like: new TableReference().setProjectId("projectfoo").setDatasetId("datasetfoo").setTableId("tablefoo") 您可以使用以下方式创建TableReference： new TableReference().setProjectId("projectfoo").setDatasetId("datasetfoo").setTableId("tablefoo")

This isn't 100% recommended as BigQueryIO does some nice stuff to split up the rows that need inserting to maximise throughput and handles retries properly. 不建议您100％这样做，因为BigQueryIO会做一些不错的事情来拆分需要插入的行，以最大化吞吐量并正确处理重试。

从数据流向BigQuery插入数据

问题描述

2 个解决方案

解决方案1
1 2016-08-04 20:29:49

解决方案2
-1 2016-08-05 22:22:21

从数据流向BigQuery插入数据

问题描述

2 个解决方案

解决方案1 1 2016-08-04 20:29:49

解决方案2 -1 2016-08-05 22:22:21

解决方案1
1 2016-08-04 20:29:49

解决方案2
-1 2016-08-05 22:22:21