简体   繁体   English

BigQuery 插入值 AS,假设缺失列为空值

[英]BigQuery insert values AS, assume nulls for missing columns

Imagine there is a table with 1000 columns.假设有一个包含 1000 列的表。 I want to add a row with values for 20 columns and assume NULL s for the rest.我想添加一行包含 20 列的值,并假设NULL为 NULL。

INSERT VALUES syntax can be used for that: INSERT VALUES语法可用于此:

INSERT INTO `tbl` (
  date, 
  p, 
  ... # 18 more names
)
VALUES(
  DATE('2020-02-01'), 
  'p3',
  ... # 18 more values
)

The problem with it is that it is hard to tell which value corresponds to which column.它的问题是很难分辨哪个值对应于哪个列。 And if you need to change/comment out some value then you have to make edits in two places.如果您需要更改/注释掉某些值,则必须在两个地方进行编辑。

INSERT SELECT syntax also can be used:也可以使用INSERT SELECT语法:

INSERT INTO `tbl`
SELECT 
  DATE('2020-02-01') AS date, 
  'p3' AS p,
  ... # 18 more value AS column
  ... # 980 more NULL AS column

Then if I need to comment out some column just one line has to be commented out.然后,如果我需要注释掉某些列,则只需注释掉一行。 But obviously having to set 980 NULL s is an inconvenience.但显然必须设置 980 NULL s 是很不方便的。

What is the way to combine both approaches?结合这两种方法的方法是什么? To achieve something like:实现类似的目标:

INSERT INTO `tbl`
SELECT 
  DATE('2020-02-01') AS date, 
  'p3' AS p,
  ... # 18 more value AS column

The query above doesn't work, the error is Inserted row has wrong column count; Has 20, expected 1000上面的查询不起作用,错误是Inserted row has wrong column count; Has 20, expected 1000 Inserted row has wrong column count; Has 20, expected 1000 . Inserted row has wrong column count; Has 20, expected 1000

Your first version is really the only one you should ever be using for SQL inserts.您的第一个版本确实是您唯一应该用于 SQL 插件的版本。 It ensures that every target column is explicitly mentioned, and is unambiguous with regard to where the literals in the VALUES clause should go.它确保每个目标列都被明确提及,并且对于VALUES子句中的文字应该 go 的位置是明确的。 You can use the version which does not explicitly mention column names.可以使用未明确提及列名的版本。 At first, it might seem that you are saving yourself some code.起初,您似乎在为自己节省一些代码。 But realize that there is a column list which will be used, and it is the list of all the table's columns, in whatever their positions from definition are.但是要意识到有一个列列表将被使用,它是所有表列的列表,无论它们在定义中的位置是什么。 Your code might work, but appreciate that any addition/removal of a column, or changing of column order, can totally break your insert script.您的代码可能会起作用,但请注意,任何添加/删除列或更改列顺序都可能完全破坏您的插入脚本。 For this reason, most will strongly advocate for the first version.出于这个原因,大多数人会强烈主张第一个版本。

You can try following solution, it is combination of above 2 process which you have highlighted in case study:-您可以尝试以下解决方案,它是您在案例研究中突出显示的上述 2 个过程的组合:-

INSERT INTO `tbl` (date, p, 18 other coll names)
SELECT 
  DATE('2020-02-01') AS date, 
  'p3' AS p,
  ... # 18 more value AS column 

Couple of things you should consider here are:-您应该在这里考虑的几件事是:-

  1. Other 980 Columns should ne Nullable, that means it should hold NULL values.其他 980 列应该是 Nullable,这意味着它应该保存 NULL 值。
  2. All 18 columns in Insert line and Select should be in same order so that data will be inserted in same correct order.插入行和 Select 中的所有 18 列的顺序应相同,以便以相同的正确顺序插入数据。
  3. To Avoid any confusion, try to use Alease in Select Query same as Insert Table Column name.为避免任何混淆,请尝试在 Select 查询中使用与插入表列名称相同的 Alease。 It will remove any ambiguity.它将消除任何歧义。

Hopefully it will work for you.希望它对你有用。

In BigQuery, the best way to do what you're describing is to first load to a staging table.在 BigQuery 中,执行您所描述的操作的最佳方法是首先加载到暂存表。 I'll assume you can get the values you want to insert into JSON format with keys that correspond to the target column names.我假设您可以使用与目标列名称相对应的键获取要插入到 JSON 格式中的值。

values.json

{"date": "2020-01-01", "p": "p3", "column": "value", ... }

Then generate a schema file for the target table and save it locally然后为目标表生成schema文件并保存到本地

bq show --schema project:dataset.tbl > schema.json

Load the new data to the staging table using the target schema.使用目标架构将新数据加载到暂存表。 This gives you "named" null values for each column present in the target schema but missing from your json, bypassing the need to write them out.这会为目标模式中存在但在 json 中缺失的每一列“命名” null 值,绕过将它们写出的需要。

bq load --replace --source_format=NEWLINE_DELIMIITED_JSON \
project:dataset.stg_tbl values.json schema.json

Now the insert select statement works every time现在 insert select 语句每次都有效

insert into `project:dataset.tbl`
select * from `project:dataset.stg_tbl`

Not a pure SQL solution but I managed this by loading my staging table with data then running something like:不是一个纯粹的 SQL 解决方案,但我通过用数据加载我的暂存表然后运行类似的东西来管理这个:

from google.cloud import bigquery

client = bigquery.Client()

table1 = client.get_table(f"{project_id}.{dataset_name}.table1")
table1_col_map = {field.name: field for field in table1.schema}

table2 = client.get_table(f"{project_id}.{dataset_name}.table2")
table2_col_map = {field.name: field for field in table2.schema}

combined_schema = {**table2_col_map, **table1_col_map}

table1.schema = list(combined_schema.values())

client.update_table(table1_cols, ["schema"])

Explanation:解释:

This will retrieve the schemas of both, convert their schemas into a dictionary with key as column name and value as the actual field info from the sdk. Then both are combined with dictionary unpacking (the order of unpacking determines which table's columns have precedence when a column is common between them. Finally the combined schema is assigned back to table 1 and used to update the table, adding the missing columns with nulls.这将检索两者的模式,将它们的模式转换为一个字典,其中键作为列名,值作为 sdk 中的实际字段信息。然后将两者与字典解包结合(解包的顺序决定了当一个表的列优先column are common between them. 最后,组合模式被分配回表 1 并用于更新表,添加带有空值的缺失列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM