基于Cust_Id / Tenant_Id（多租户）的Azure SQL DW中的表分区？

Question

I am new to MS BI stack and trying to create a partition on a SQL DW table field ie cust_id which is unique for each customer (tenant_id) and present in every table, so that every customer's data remains in their respective partitions, without impacting others. 我是MS BI堆栈的新手，尝试在SQL DW表字段（即cust_id）上创建分区，该分区对于每个客户（tenant_id）都是唯一的，并且在每个表中都存在，以便每个客户的数据保留在各自的分区中，而不会影响其他分区。

Below is the table structure, data and output from the partitions: 下面是分区的表结构，数据和输出：

create table emp
(
cust_id integer not null,
emp_id varchar(5) not null,
emp_name varchar(10) not null
)
with 
(
  clustered columnstore index
 ,distribution = round_robin
 ,partition (cust_id range right for values (100,200,300) )
)

create table dept
(
cust_id integer not null,
dept_id varchar(5) not null,
emp_id varchar(5) not null,
dep_name varchar(10) not null
)
with 
(
  clustered columnstore index
 ,distribution = round_robin
 ,partition (cust_id range right for values (100,200,300) )
)

create statistics emp_stats on dbo.emp(cust_id)
create statistics dept_stats on dbo.dept(cust_id)

emp table:
101 EMP01   XYZ
101 EMP02   ABC
101 EMP03   DEF
201 EE001   JACK
201 EE002   MIKE

dept table:
cust_id dept_id emp_id  dep_name
101 D0001   EMP01   IT
101 D0001   EMP02   IT
201 DEP01   EE001   ENG
201 DEP02   EE002   HR


 SELECT sch.name AS [schema_name],
       tbl.[name] AS [table_name],
       ds.type_desc,
       prt.[partition_number],
       rng.[value] AS [current_partition_range_boundary_value],
       prt.[rows] AS [partition_rows]    
FROM   sys.schemas sch
       INNER JOIN sys.tables tbl    ON  sch.schema_id       = tbl.schema_id
       INNER JOIN sys.partitions prt    ON  prt.[object_id]     = tbl.[object_id]
       INNER JOIN sys.indexes idx   ON  prt.[object_id]     = idx.[object_id] AND prt.[index_id] = idx.[index_id]
       INNER JOIN sys.data_spaces               ds  ON  idx.[data_space_id] = ds.[data_space_id]                       
       INNER JOIN sys.partition_schemes     ps  ON  ds.[data_space_id]  = ps.[data_space_id]                
       INNER JOIN sys.partition_functions       pf  ON  ps.[function_id]    = pf.[function_id]              
       LEFT JOIN sys.partition_range_values rng ON  pf.[function_id]    = rng.[function_id] AND rng.[boundary_id] = prt.[partition_number]    
WHERE      tbl.name in ('emp','dept')
order by table_name, partition_number


schema_name table_name  type_desc   partition_number    current_partition_range_boundary_value  partition_rows
dbo dept    PARTITION_SCHEME    1   100 1
dbo dept    PARTITION_SCHEME    2   200 1
dbo dept    PARTITION_SCHEME    3   300 1
dbo dept    PARTITION_SCHEME    4   NULL    1
dbo emp PARTITION_SCHEME    1   100 1
dbo emp PARTITION_SCHEME    2   200 1
dbo emp PARTITION_SCHEME    3   300 1
dbo emp PARTITION_SCHEME    4   NULL    2

Questions/Clarifications: 问题/澄清：

1)  Whether the partition created on cust_id (tenant_id) field along with round_robin distribution method correct? What is the right way to do it? Need to segregate the customer specific data for both performance (load + query) & security reasons.
2)  How can we load specific customer data into their respective partition (cust_id) – syntax in SQL DW? 
insert into emp (partition = <partition_name_number> ) ?
3)  How do I verify that the data is getting loaded into correct respective partition as I am unable to understand the output from above query as to how it is showing 4 partitions and only 1 row for cust_id 101 in emp table when actually there are 3 ? Was expecting that since 101 is between 100 and 200, it should be in partition_number = 1  and 201 which is between 200 and 300 in partition_number = 2? Is this a wrong assumption on how partition range works? Can’t we simply have a List partition created in SQL DW for each cust_id ?
4)  As per MS documentation, it by default divides each table into 60 distributed databases and that there should be at least 1M per distribution for a partition. When you don’t know how many customer’s data and their volume you may have in future, how do we approach towards it ?
5)  When creating semantic layer (Analysis Services-SSAS) on top of DW, is it helpful to do further do partition on tenant_id or some other field?

Thanks for your help, inputs and suggestions !! 感谢您的帮助，意见和建议！

Answer 1

Partitioning is available on Azure SQL Data Warehouse, but the documentation comes with a warning that it may be counter-productive. 分区在Azure SQL数据仓库上可用，但是文档随附警告，可能会适得其反。 Your example may be a good case. 您的示例可能是一个很好的例子。

You should look to distribution rather than partitioning to make sure that your data is optimally aligned across the nodes. 您应该查看分布而不是分区，以确保您的数据在节点之间达到最佳对齐。

Don't use round_robin distribution unless there is a strong reason that you can't use hash or replication. 除非有强烈的理由不能使用哈希或复制，否则不要使用round_robin分发。 Round-robin will load fast, but subsequent queries on that table will be slow. 轮询将快速加载，但是对该表的后续查询将很慢。

Are you implementing a dimensional model? 您是否正在实施尺寸模型？ How many customers do you have? 您有几个客户？ The general guidance is to REPLICATE dimensions unless they are very large (1B+ rows) in which case they may be the drive the hash distribution strategy. 一般指导原则是复制维度，除非它们很大（1B +行），在这种情况下，它们可能是哈希分配策略的驱动力。

基于Cust_Id / Tenant_Id（多租户）的Azure SQL DW中的表分区？

问题描述

1 个解决方案

解决方案1
0 2019-02-09 02:48:57

基于Cust_Id / Tenant_Id（多租户）的Azure SQL DW中的表分区？

问题描述

1 个解决方案

解决方案1 0 2019-02-09 02:48:57

解决方案1
0 2019-02-09 02:48:57