简体   繁体   English

如何为事实表建模

[英]How to model a fact table

I'm about to create a data warehouse with facts and dimensions in a star-schema. 我将创建一个星型模式中包含事实和维度的数据仓库。

The business questions I want to answer are typically these: 我要回答的业务问题通常是:

  • How much money did we sell for in Q1? 我们在第一季度卖了多少钱?
  • How much money did we sell for in Q1 to females? 我们在第一季度卖给女性多少钱?
  • How much money did we sell for in Q1 to females between age 30-35? 我们在第一季度向30-35岁之间的女性出售了多少钱?
  • How much money did we sell for in Q1 to females between age 30-35 living in new york? 我们在第一季度向居住在纽约的30-35岁女性出售了多少钱?
  • How much money did we sell for in Q1 to females between age 30-35 living in new york? 我们在第一季度向居住在纽约的30-35岁女性出售了多少钱?

  • How much money did we sell for in category clothes last year? 去年我们卖了多少钱在类别衣服上?

  • How much money did we sell for of the product blue jeans last year? 去年我们卖出蓝色牛仔裤产品多少钱?
  • How much money did we sell for of the product blue jeans to males between 40-42 living in Australia last year? 去年,我们向居住在澳大利亚的40-42岁之间的男性出售了多少件蓝色牛仔裤产品?

I am thinking of a date dimension with the granularity of an hour (specifying year, month, day, hour, quarter, name of day, name of month etc.) I am also thinking of a product dimension and a user dimension. 我正在考虑一个小时的粒度(指定年,月,日,小时,季度,日期名称,月名称等)的日期维度,也正在考虑产品维度和用户维度。

I wonder if these questions could be answered using a single fact table or if its proper to create multiple fact tables? 我想知道是否可以使用单个事实表来回答这些问题,或者是否适合创建多个事实表? I am thinking of a table such as: 我正在考虑一个表,例如:

FactSales 事实销售

DimDate - fk to a table containting information about the date (such as quarter, day of week, year, month, day) DimDate-fk到包含有关日期(例如季度,星期几,年,月,日)信息的表

DimProduct - fk to a table containing information about the product such as (product name) DimProduct-fk到包含有关产品信息的表,例如(产品名称)

DimUser - fk to a table containing information about the user such as (age, gender) DimUser-fk到包含有关用户信息(例如(年龄,性别)的信息)的表

TotalSales - a SUM of all sales for those particular date,product and user. TotalSales-特定日期,产品和用户的所有销售额的总和。

Also, if I would like to measure booth the total sales (money) and the total number of sales? 另外,如果我想衡量展位的总销售额(金额)和总销售数量? Would it be proper to create a new fact table with the same dimensions but using TotalNumberOfSales as the fact instead? 创建具有相同维度但使用TotalNumberOfSales代替事实的新事实表是否合适?

Thankful for all input I can get about this. 感谢所有我能得到的输入。

I think you are on the right track. 我认为您在正确的轨道上。 All questions above should be possible to answer using only one fact table covering up the sales. 上面的所有问题都应该仅使用覆盖销售的一个事实表即可回答。

I think one should start out unaggregated, and rather aggregate later if needed. 我认为应该一开始就不进行汇总,而后在需要时汇总。 Considering that one sale can contain multiple products and multiple items, I'd organize it as follows ... one fact row for each product in the sale (typically lines on the invoice, so I'd call it "order lines" or "sale lines"), and maybe three counter attributes: 考虑到一笔交易可以包含多个产品和多个项目,因此我将其组织如下:交易中每种产品的事实行(通常是发票上的行,因此我将其称为“订单行”或“销售线”),可能还有三个计数器属性:

  • NumItems - number of items, ie 3 if the customer bought three of the same product. NumItems项目数,即3(如果客户购买了三个相同的产品)。
  • NumLines - number of "order lines" - should always be 1. May be useful when aggregating data later (big win to already have sum(NumLines) rather than count(*) in the SQL), or when adding correction items ( NumLines = -1 ). NumLines (“订单行”的数量)应始终为1。在以后聚合数据时count(*)在SQL中已经有sum(NumLines)而不是count(*)大获成功)或添加更正项( NumLines = -1 )。
  • NumSales - a fractional number so it can be summed up to yield the number of sales (ie 0.333 if the sale involves three different products and hence contains three order lines). NumSales小数,因此可以求和以得出销售数量(即,如果销售涉及三种不同产品并因此包含三个订单行,则为0.333)。

Now, one will get a problem to get the right count ie for "number of sales involving black clothes". 现在,要获得正确的计数会遇到一个问题,即“涉及黑衣服的销售数量”。 We had this problem at my previous workplace - I'm sure there must exist some "best practice" for this, we ended up more or less by introducing a SaleID in the fact table (or TransactionID ) and do count(distinct SaleID) . 我们在以前的工作场所遇到了这个问题-我确信必须为此存在一些“最佳实践”,我们通过在事实表(或TransactionID )中引入SaleID并进行count(distinct SaleID) That lacks elegance, but works. 缺乏优雅,但有效。

In our setup we had several money attributes - most important, one for the revenue (what's left of the income after paying the direct costs attributed with the items sold) and one for the turnover (the price paid by the customer for the item). 在我们的设置中,我们有几个货币属性-最重要的是,一个属性是收入(支付了所售商品所产生的直接成本后的剩余收入),另一个是营业额(客户为该商品支付的价格)。 Sales tax or VAT may add more complications. 营业税或增值税可能会增加更多的复杂性。 One can make it with only one money attribute and then split the sales up into multiple lines in the fact table, but I think I would rather recommend multiple money columns in the sales line fact table. 一个人可以只用一个money属性就能实现,然后将销售分成事实表中的多行,但我想我会建议在销售行事实表中有多个money列。 Everything in the fact table was counted in "base currency" (Euros, in our case), and then we had an exchange rate dimension to track the exact amounts. 事实表中的所有内容均以“基础货币”(在我们的情况下为欧元)计算,然后我们有了一个汇率维度来跟踪确切的金额。

I don't think it makes sense to have a date dimension containing the hour of the day. 我认为具有包含一天中时间的日期维度是没有意义的。 At my former work I kept my warehouse in postgres, and I actually managed quite well without a date dimension at all - although a date dimension is considered "best business practice" I found that performance-wise for all our purposes we got much better performance by using standard postgres date functions instead of dragging in a date dimension. 在我以前的工作中,我将仓库保存在postgres中,并且即使没有日期维度,我实际上也做得很好-尽管日期维度被认为是“最佳业务实践”,但我发现从性能角度而言,对于我们所有的目标,我们都能获得更好的性能通过使用标准的postgres日期函数,而不是拖动日期维度。 I was playing quite a lot with it, and I think in the end I found the most optimal was to split up date and time into two different attributes. 我玩了很多,最后我认为最理想的方法是将日期和时间分成两个不同的属性。 (Timezones and daylight saving gave me quite some extra headaches...) (时区和夏时制让我头疼很多……)

I agree with tobixen - you're on the right track. 我同意tobixen-您的方向正确。

I would recommend that you read Ralph Kimball's book "The Data Warehouse Toolkit", particularly the chapter on retail sales - it goes in depth about a sales fact. 我建议您阅读Ralph Kimball的书“数据仓库工具包”,特别是有关零售的一章-它详细介绍了销售事实。

The Date Dimension is like having a Calendar table - you can split based on quarters, fiscal months, and other things that are business specific to dates. 日期维度就像具有日历表一样-您可以根据季度,会计月份以及其他特定于日期的业务进行拆分。 I typically keep both the date key as well as a timestamp datatype, so we can do things with the Fiscal Calendar. 我通常同时保留日期键和时间戳记数据类型,因此我们可以使用“财政日历”执行操作。 I would actually have a separate Time dimension if you need to have your grain of the table at that level, with buckets for hours of the day, or minutes, etc. I doubt you need hourly though. 如果您需要将表的粒度保持在该级别,则需要一个单独的“时间”维度,例如每天需要几个小时或几分钟的存储桶。我怀疑您是否需要每小时使用。

Here's what I would do: 这就是我要做的:

Declare the Grain of your fact table: 声明事实表的粒度:

1 row per order line 每条订单行1行

Note how the grain doesn't contain anything that doesn't uniquely identify the row. 请注意,谷粒如何不包含任何不能唯一标识行的内容。

Dimensional Attributes of the order line: 订单行的尺寸属性:

Date
Time (if needed, and bucketed by hour/minute etc)
Product
Customer

Degenerate Dimensions of the order line (these are codes that are related to the transaction): 订单行的退化尺寸(这些是与交易相关的代码):

Order Number
Order Line Number

Some Sample Measures: 一些样本度量:

Item Price at time of Sale (optional, may be useful in some situations)
Discount Amount
Sale Dollars

This should answer all of those questions. 这应该回答所有这些问题。

For the totals, a simple COUNT / SUM after filtering on the attributes of the dimensions should work fine. 对于总数,对维度的属性进行过滤后,简单的COUNT / SUM应该可以正常工作。

You should consider that the customer dimension is one of the most difficult to model, Kimball devotes a whole chapter in his book to the customer dimension. 您应该考虑到客户维度是最难建模的模型之一,Kimball在其书中整整一章专门介绍了客户维度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM