简体   繁体   English

变体列中唯一元素的雪花查询性能

[英]Snowflake query performance on unique element in variant column

I am querying a Snowflake view that contains many TB of semi-structured json data.我正在查询包含许多 TB 半结构化 json 数据的雪花视图。 When I query the variant column of interest for an element that is not unique among the records, results are returned within seconds:当我在感兴趣的变量列中查询在记录中不唯一的元素时,会在几秒钟内返回结果:

SELECT json_data:element1 FROM table WHERE json_data:common_category = 'CATEGORY1';

When I query the variant column of interest for an element that is unique among the records, runtime slows to some unacceptable amount of time that I have not yet reached:当我在感兴趣的变量列中查询记录中唯一的元素时,运行时会减慢到我尚未达到的一些不可接受的时间量:

SELECT json_data:element1 FROM table WHERE json_data:unique_id = 'ID123456';

I believe that flattening the unique element into a relational form outside of the variant column would increase performance, but I am not a DBA with these permissions.我相信将唯一元素展平为变量列之外的关系形式会提高性能,但我不是拥有这些权限的 DBA。 Is there some way to tune my query such that looking up a single record based on the variant column json data will yield acceptable performance?有没有办法调整我的查询,以便根据变量列 json 数据查找单个记录将产生可接受的性能?

Snowflake stores internally variant (json) data, in a independent column like structure for the 100+ most common elements, and the remainder in a leftovers like column. Snowflake 将内部变体 (json) 数据存储在独立的列中,用于 100 多个最常见元素的结构,其余存储在剩余列中。 Those virtual columns have min/max, distribution like stats like normal columns have.这些虚拟列具有最小值/最大值,分布类似于普通列的统计信息。

notes 1 notes 2 注释 1 注释 2

This means on the major columns of your data, they can prune lots of unneeded partitions for reading (if your data is naturally ordered in a way the helps this).这意味着在数据的主要列上,它们可以修剪大量不需要的分区以供读取(如果您的数据以某种方式自然排序,则有助于此)。

It also means if your are using a couple of columns from the JSON it reads just those stripes, thus less IO.这也意味着如果您使用 JSON 中的几列,它只会读取那些条纹,因此 IO 会更少。

Also when you select the whole blob like you do here, the second point does not come into play, as the READ for the SELECT and the READ for the WHERE are the same.此外,当您像您在此处一样使用 select 整个 blob 时,第二点不会起作用,因为 SELECT 的 READ 和 WHERE 的 READ 是相同的。

So for you queries you will see the first query all a small number of partition in. For your second query you will see it plans to read all partitons.因此,对于您的查询,您将看到第一个查询中包含少量分区。对于您的第二个查询,您将看到它计划读取所有分区。

If you alter your first query to:如果您将第一个查询更改为:

SELECT json_data:common_category FROM table WHERE json_data:common_category = 'CATEGORY1';

You will see that the number of partition read is the same as the first example, but the number of bytes read should be a fraction.您会看到分区读取的数量与第一个示例相同,但读取的字节数应该是分数。

Again like for normal tables you should always name all your columns and avoid SELECT * FROM TABLE as then the plan knows what to pull.再次像普通表一样,您应该始终命名所有列并避免SELECT * FROM TABLE因为计划知道要拉什么。 You will see a statistically faster compile times when you name all your first order columns and all your variant columns.当您命名所有一阶列和所有变体列时,您将看到统计上更快的编译时间。

In the context of making it faster:在使其更快的情况下:

if you have to have All the JSON columns then and如果您必须拥有所有 JSON 列然后

SELECT json_data FROM table WHERE json_data:common_category = 'CATEGORY1';

has acceptable speed then do:有可接受的速度然后做:

SELECT json_data:common_category FROM table WHERE json_data:unique_id = 'ID123456';
SELECT json_data FROM table WHERE json_data:common_category = <answer from prior> and json_data:unique_id = 'ID123456';

this way the first query is reading the least amount form all partitions, and the second is reading the all from the partition that have to be read from..这样,第一个查询是从所有分区中读取最少的数量,第二个是从必须读取的分区中读取所有内容..

Now this will not always work if for example the common_category for unique_id = 'ID123456' is common to all partitions, but if you have some other column on all rows, that is sequential or aligned with the sort of the data (be that how you ingest the data thus the write order, or how you order the data if you have it clustered).现在,如果例如common_category unique_id = 'ID123456'对所有分区都是通用的,那么这并不总是有效,但是如果您在所有行上都有一些其他列,它是顺序的或与数据的排序对齐(就是你如何摄取数据,因此写入顺序,或者如果你有集群数据,你如何排序数据)。 Then select the filter column and the order columns then select the full match with the focusing effect of the ordering column.然后 select 过滤列和排序列然后 select 完全匹配排序列的聚焦效果。

We have very similar audit data the above pattern is used on, and other data that we store in multiple tables, some of the tables are super skinning and ordered (via cluster keys) and then we have a key that is insert_time of both that fast table and a wide/fat json table with all the "extra's" that are often not used, but are written in _insert_time order, thus finding the desired data in the fast table allow read the wide table with reduced partitions.我们有使用上述模式的非常相似的审计数据,以及我们存储在多个表中的其他数据,其中一些表是超级蒙皮和有序的(通过集群键),然后我们有一个键是 insert_time 两者都那么快表和一个宽/胖 json 表,其中包含所有经常不使用但按 _insert_time 顺序写入的“额外”表,因此在快速表中找到所需的数据允许读取具有减少分区的宽表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM