[英]BigQuery insert values AS, assume nulls for missing columns
Imagine there is a table with 1000 columns.假设有一个包含 1000 列的表。 I want to add a row with values for 20 columns and assume
NULL
s for the rest.我想添加一行包含 20 列的值,并假设
NULL
为 NULL。
INSERT VALUES
syntax can be used for that: INSERT VALUES
语法可用于此:
INSERT INTO `tbl` (
date,
p,
... # 18 more names
)
VALUES(
DATE('2020-02-01'),
'p3',
... # 18 more values
)
The problem with it is that it is hard to tell which value corresponds to which column.它的问题是很难分辨哪个值对应于哪个列。 And if you need to change/comment out some value then you have to make edits in two places.
如果您需要更改/注释掉某些值,则必须在两个地方进行编辑。
INSERT SELECT
syntax also can be used:也可以使用
INSERT SELECT
语法:
INSERT INTO `tbl`
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
... # 980 more NULL AS column
Then if I need to comment out some column just one line has to be commented out.然后,如果我需要注释掉某些列,则只需注释掉一行。 But obviously having to set 980
NULL
s is an inconvenience.但显然必须设置 980
NULL
s 是很不方便的。
What is the way to combine both approaches?结合这两种方法的方法是什么? To achieve something like:
实现类似的目标:
INSERT INTO `tbl`
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
The query above doesn't work, the error is Inserted row has wrong column count; Has 20, expected 1000
上面的查询不起作用,错误是
Inserted row has wrong column count; Has 20, expected 1000
Inserted row has wrong column count; Has 20, expected 1000
. Inserted row has wrong column count; Has 20, expected 1000
。
Your first version is really the only one you should ever be using for SQL inserts.您的第一个版本确实是您唯一应该用于 SQL 插件的版本。 It ensures that every target column is explicitly mentioned, and is unambiguous with regard to where the literals in the
VALUES
clause should go.它确保每个目标列都被明确提及,并且对于
VALUES
子句中的文字应该 go 的位置是明确的。 You can use the version which does not explicitly mention column names.您可以使用未明确提及列名的版本。 At first, it might seem that you are saving yourself some code.
起初,您似乎在为自己节省一些代码。 But realize that there is a column list which will be used, and it is the list of all the table's columns, in whatever their positions from definition are.
但是要意识到有一个列列表将被使用,它是所有表列的列表,无论它们在定义中的位置是什么。 Your code might work, but appreciate that any addition/removal of a column, or changing of column order, can totally break your insert script.
您的代码可能会起作用,但请注意,任何添加/删除列或更改列顺序都可能完全破坏您的插入脚本。 For this reason, most will strongly advocate for the first version.
出于这个原因,大多数人会强烈主张第一个版本。
You can try following solution, it is combination of above 2 process which you have highlighted in case study:-您可以尝试以下解决方案,它是您在案例研究中突出显示的上述 2 个过程的组合:-
INSERT INTO `tbl` (date, p, 18 other coll names)
SELECT
DATE('2020-02-01') AS date,
'p3' AS p,
... # 18 more value AS column
Couple of things you should consider here are:-您应该在这里考虑的几件事是:-
Hopefully it will work for you.希望它对你有用。
In BigQuery, the best way to do what you're describing is to first load to a staging table.在 BigQuery 中,执行您所描述的操作的最佳方法是首先加载到暂存表。 I'll assume you can get the values you want to insert into JSON format with keys that correspond to the target column names.
我假设您可以使用与目标列名称相对应的键获取要插入到 JSON 格式中的值。
values.json
{"date": "2020-01-01", "p": "p3", "column": "value", ... }
Then generate a schema file for the target table and save it locally然后为目标表生成schema文件并保存到本地
bq show --schema project:dataset.tbl > schema.json
Load the new data to the staging table using the target schema.使用目标架构将新数据加载到暂存表。 This gives you "named" null values for each column present in the target schema but missing from your json, bypassing the need to write them out.
这会为目标模式中存在但在 json 中缺失的每一列“命名” null 值,绕过将它们写出的需要。
bq load --replace --source_format=NEWLINE_DELIMIITED_JSON \
project:dataset.stg_tbl values.json schema.json
Now the insert select statement works every time现在 insert select 语句每次都有效
insert into `project:dataset.tbl`
select * from `project:dataset.stg_tbl`
Not a pure SQL solution but I managed this by loading my staging table with data then running something like:不是一个纯粹的 SQL 解决方案,但我通过用数据加载我的暂存表然后运行类似的东西来管理这个:
from google.cloud import bigquery
client = bigquery.Client()
table1 = client.get_table(f"{project_id}.{dataset_name}.table1")
table1_col_map = {field.name: field for field in table1.schema}
table2 = client.get_table(f"{project_id}.{dataset_name}.table2")
table2_col_map = {field.name: field for field in table2.schema}
combined_schema = {**table2_col_map, **table1_col_map}
table1.schema = list(combined_schema.values())
client.update_table(table1_cols, ["schema"])
Explanation:解释:
This will retrieve the schemas of both, convert their schemas into a dictionary with key as column name and value as the actual field info from the sdk. Then both are combined with dictionary unpacking (the order of unpacking determines which table's columns have precedence when a column is common between them. Finally the combined schema is assigned back to table 1 and used to update the table, adding the missing columns with nulls.这将检索两者的模式,将它们的模式转换为一个字典,其中键作为列名,值作为 sdk 中的实际字段信息。然后将两者与字典解包结合(解包的顺序决定了当一个表的列优先column are common between them. 最后,组合模式被分配回表 1 并用于更新表,添加带有空值的缺失列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.