简体   繁体   English

从数据流向BigQuery插入数据

[英]Insert data in BigQuery from Dataflow

Previously, had PCollection formattedResults; 以前有PCollection formattedResults; and I was using below code to insert rows in big query: 我正在使用下面的代码在大查询中插入行:

                   // OPTION 1
PCollection<TableRow> formattedResults = ....
formattedResults.apply(BigQueryIO.Write.named("Write").to(tableName)
                            .withSchema(tableSchema)
                            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

And all rows was directly inserted in BigQuery, all well till here. 并且所有行都直接插入到BigQuery中,直到现在为止。 But now I have started to dynamically identify the table name and its row so am creating PCollection as below: (String will be the table name and then its row as value) 但是现在我已经开始动态地标识表名及其行,因此按如下方式创建PCollection :(字符串将是表名,然后将其行作为值)

PCollection<KV<String, TableRow>>   tableRowMap // OPTION 2

Also, I am creating group of rows which will go in same table as: 另外,我正在创建一组行,它们将在同一表中显示:

PCollection<KV<String, Iterable<TableRow>>> groupedRows  //OPTION 3

where key (String) is the BQ table name and value is the list of rows to be inserted in BQ. 其中键(字符串)是BQ表名,值是要在BQ中插入的行的列表。

With option 1, I was able to easily insert rows in BQ using code shown above but same code cannot be used with OPTION 2 or OPTION 3 because in this case my table name is key in map. 使用选项1,我能够使用上面显示的代码轻松地在BQ中插入行,但是相同的代码不能用于OPTION 2或OPTION 3,因为在这种情况下,我的表名是map中的键。 Is there a way to insert rows in table using OPTION 2 or OPTION 3. Any link or code sample will be great help. 有没有一种方法可以使用OPTION 2或OPTION 3在表中插入行。任何链接或代码示例都将提供很大的帮助。

The closest thing that Dataflow is writing to a table per window (and you can create your own BoundedWindow subclass and WindowFn to include whatever data you want in the window). Dataflow最接近每个窗口写入表的内容(您可以创建自己的BoundedWindow子类和WindowFn以在窗口中包含所需的任何数据)。 To do this, use 为此,请使用

to(SerializableFunction<BoundedWindow,String> tableSpecFunction)

on BigQueryIO.Write. 在BigQueryIO.Write上。

Note that this functionality uses BigQuery's streaming upload feature, which is limited to 100MB/s per table. 请注意,此功能使用BigQuery的流式上传功能,每个表的最大上传速度为100MB / s。 Additionally, uploads are not atomic, so a failed batch job may upload only part of the output. 此外,上传不是原子的,因此失败的批处理作业可能仅上传部分输出。

You've also got the option to create you're own DoFn that directly inserts data into bigquery, instead of relying on BigQueryIO.Write. 您还可以选择创建自己的DoFn,该DoFn直接将数据插入到bigquery中,而不是依赖于BigQueryIO.Write。 Technically you need to create a BigQueryTableInserter , you can use the insertAll(TableReference ref, List<TableRow> rowList) to insert stuff into your desired table. 从技术上讲,您需要创建一个BigQueryTableInserter ,您可以使用insertAll(TableReference ref, List<TableRow> rowList)将内容插入所需的表中。

You can create a TableReference using something like: new TableReference().setProjectId("projectfoo").setDatasetId("datasetfoo").setTableId("tablefoo") 您可以使用以下方式创建TableReference: new TableReference().setProjectId("projectfoo").setDatasetId("datasetfoo").setTableId("tablefoo")

This isn't 100% recommended as BigQueryIO does some nice stuff to split up the rows that need inserting to maximise throughput and handles retries properly. 不建议您100%这样做,因为BigQueryIO会做一些不错的事情来拆分需要插入的行,以最大化吞吐量并正确处理重试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM