简体   繁体   English

BigQuery JSON 架构验证

[英]BigQuery JSON schema validation

Are there any tools that will validate a JSON string against a BigQuery schema?是否有任何工具可以根据 BigQuery 架构验证 JSON 字符串? I'd like to load valid ones to BQ, and re-process invalid ones.我想将有效的加载到 BQ,并重新处理无效的。

I know that you can validate against a standard JSON schema using (eg) python's jsonschema, is there something similar for BQ schemas?我知道您可以使用(例如)python 的 jsonschema 针对标准 JSON 模式进行验证,BQ 模式是否有类似的东西?


Re Pentium10's comment, I can imagine a number of ETL scenarios where data from several sources has to be assembled such that it matches a BQ schema - currently I need 2 schemas for the data, a JSON Schema, and a BQ schema - I validate against the JSON schema and hope that this is enough to satisfy the BQ schema on submission. Re Pentium10 的评论,我可以想象许多 ETL 场景,其中必须组装来自多个来源的数据,以便它与 BQ 模式匹配 - 目前我需要 2 个数据模式,一个 JSON 模式和一个 BQ 模式 - 我验证了JSON 模式,并希望这足以满足提交时的 BQ 模式。


Specifically: in this situation, I have JSON which has arrived from a javascript front end, and been entered into BQ as a string.具体来说:在这种情况下,我有一个来自 javascript 前端的 JSON,并作为字符串输入到 BQ 中。 I want to process this field, and add it to BQ as a table in its own right, so that I can search it.我想处理这个字段,并将它作为一个表本身添加到BQ中,以便我可以搜索它。

The JSON (more or less) falls into 2 'schemas', but it is poorly TYPED ( ie numbers are treated as strings, lists of length 1 are strings, not lists...). JSON(或多或少)分为 2 个“模式”,但它的类型很差(即数字被视为字符串,长度为 1 的列表是字符串,而不是列表......)。 I want a quick way to see whether a field would go into the table, and it seemed a little silly that I have a BQ table schema, but cannot validate against it - rather, I must also create a JSON schema for the idealised data and must check against that.我想要一种快速的方法来查看一个字段是否会进入表中,我有一个 BQ 表模式似乎有点愚蠢,但无法对其进行验证 - 相反,我还必须为理想化的数据创建一个 JSON 模式和必须检查。

如果您在 JSON-schema ( http://json-schema.org/implementations.html ) 中重新表达您的架构,那么您应该能够使用他们列出的工具之一来进行验证。

I would suggest that you use your JSON schema as a JSON object in Python, with this you could try to validate the schema using BigQuery's library.我建议你使用你的 JSON 模式作为 Python 中的 JSON 对象,这样你可以尝试使用 BigQuery 的库来验证模式。

1 - Request the Schema out of a BigQuery Table (should be then dynamically implemented): 1 - 从 BigQuery 表中请求架构(然后应该动态实现):

from google.cloud import bigquery
client = bigquery.Client(project='your_project')
dataset_ref = client.dataset('your_dataset')
table_ref = dataset_ref.table('your_table_name')
table_helper = client.get_table(table_ref)

2 - Get the schema and format it as a JSON, after it you should be able to compare the two schemas. 2 - 获取架构并将其格式化为 JSON,之后您应该能够比较两个架构。

What you have now is a list containing SchemaField()您现在拥有的是一个包含 SchemaField() 的列表

your_schema = table_helper.schema

You could try to format a list and then dump it into a JSON object...您可以尝试格式化列表,然后将其转储到 JSON 对象中...

formatted_list_schema = ["'{0}','{1}','{2}',{3},{4}".format(schema.name,schema.field_type,schema.mode,schema.description,schema.fields) for schema in table_helper.schema]

json_bq_schema = json.dumps(formatted_list_schema)

You could try to format that BQ-JSON-Schema in order to compare it as they do it here: How to compare two JSON objects with the same elements in a different order equal?您可以尝试格式化该 BQ-JSON-Schema 以便像他们在这里做的那样进行比较How to compare two JSON objects with the same elements in a different order equal?

I know that this is not a solution easy to implement, but I guess if you tweak it good enough, it will be robust and can solve your problem.我知道这不是一个容易实现的解决方案,但我想如果你把它调整得足够好,它会很健壮,可以解决你的问题。 Feel free to ask if I can help you more...随意问我是否可以帮助您更多...

Check for more info about schemas https://cloud.google.com/bigquery/docs/schemas检查有关架构的更多信息https://cloud.google.com/bigquery/docs/schemas

It's hard to answer without any examples provided, but you can use jsonschema for that generally.如果不提供任何示例,很难回答,但您通常可以使用jsonschema

Here's metaschema definition in YAML:这是 YAML 中的元模式定义:

"$schema": http://json-schema.org/draft-07/schema

title: Metaschema for BigQuery fields definition schemas
description: "See also: https://cloud.google.com/bigquery/docs/schemas"

type: array
minItems: 1
uniqueItems: yes

items:
  "$id": "#/items"
  title: Single field definition schema
  type: object

  examples:

  - name: Item_Name
    type: STRING
    mode: NULLABLE
    description: Name of catalog item

  - name: Item_Category
    type: STRING
    mode: REQUIRED

  - name: Exchange_Rate
    type: NUMERIC

  additionalProperties: no
  required:
  - name
  - type

  properties:

    name:
      "$id": "#/items/properties/name"
      title: Name of field
      description: "See also: https://cloud.google.com/bigquery/docs/schemas#column_names"
      type: string
      minLength: 1
      maxLength: 128
      pattern: "^[a-zA-Z_]+[a-zA-Z0-9_]*$"
      examples:
      - Item_Name
      - Exchange_Rate

    description:
      "$id": "#/items/properties/description"
      title: Description of field
      description: "See also: https://cloud.google.com/bigquery/docs/schemas#column_descriptions"          
      type: string
      maxLength: 1024

    type:
      "$id": "#/items/properties/type"
      title: Name of BigQuery data type
      description: 'See also: https://cloud.google.com/bigquery/docs/schemas#standard_sql_data_types'
      type: string
      enum:
      - INTEGER
      - FLOAT
      - NUMERIC
      - BOOL
      - STRING
      - BYTES
      - DATE
      - DATETIME
      - TIME
      - TIMESTAMP
      - GEOGRAPHY

    mode:
      "$id": "#/items/properties/mode"
      title: Mode of field
      description: 'See also: https://cloud.google.com/bigquery/docs/schemas#modes'
      type: string
      default: NULLABLE
      enum:
      - NULLABLE
      - REQUIRED
      - REPEATED

This is the most precise metaschema I've been able to generate from GCP docs.这是我能够从 GCP 文档生成的最精确的元模式。 Structures and arrays are not supported here, though.但是,此处不支持结构和数组。

YAML is just for readability here and you can easily convert it into JSON if needed. YAML 只是为了便于阅读,如果需要,您可以轻松地将其转换为 JSON。

Assuming the metaschema from above is saved as "/path/to/metaschema.yaml", the usage is the following:假设上面的metaschema保存为“/path/to/metaschema.yaml”,用法如下:

import json

from pathlib import Path

import jsonschema
import yaml


metaschema = yaml.safe_load(Path("/path/to/metaschema.yaml").read_text())

schema = """[{"name": "foo", "type": "STRING"}]"""
schema = json.loads(schema)


jsonschema.validate(schema, metaschema)

The yaml module from above is provided by PyYAML package.上面的yaml模块由PyYAML包提供。

If the schema is valid, jsonschema.validate() function will simply pass.如果schema有效, jsonschema.validate()函数将简单地通过。 Otherwise, jsonschema.exceptions.ValidationError will be thrown with error explanation.否则,将抛出jsonschema.exceptions.ValidationError并带有错误解释。

It's up to you whether to use JSON or YAML and how to store and parse schemas.是使用 JSON 还是 YAML 以及如何存储和解析模式取决于您。

Also it's up to you whether to convert names of types and modes to upper-/lowercase.此外,是否将类型和模式的名称转换为大写/小写也取决于您。

This is one of the implementations that I created.这是我创建的实现之一。 https://github.com/toshi0607/bq-schema-validator https://github.com/toshi0607/bq-schema-validator

It's a bit fuzzy, but it usually detects the error-prone field in a JSON log.它有点模糊,但它通常会检测 JSON 日志中的容易出错的字段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM